Research
Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
用对抗性黑客-修补循环强化Agent基准测试
Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
arXiv.orgAgent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.
Open sourceRecommended because
This is worth tracking because it is a concrete research signal, not just a passing headline. The source preview points to a research result, method, evaluation, dataset, or safety finding. For builders and operators, "Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops" can be used as a checkpoint for technical due diligence, roadmap bets, agent design, and evaluation strategy. I keep this thread indexed so future searches around AI research papers, technical methods, and applied AI systems can land on a source-linked page instead of disappearing into a fast-moving feed from arXiv.org.
What to take from this signal
Context
"Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops" is archived here as a source-linked AI signal from arXiv.org. The useful part is the connection between Hardening, Agent, Benchmarks, Adversarial, Hacker-Fixer and technical due diligence, roadmap bets, agent design, and evaluation strategy, which makes the item more actionable than a normal feed headline. The source context says: Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.
Builder takeaway
For an AI builder, the main takeaway is to watch how this signal changes practical decisions around technical feasibility, evaluation design, safety limits, and product primitives. It can inform what to test next, which product surface to compare, and whether the underlying workflow is ready for real users.
Source context
arXiv.org remains the authoritative source for the original claim. This page adds a stable archive URL, a short builder interpretation, and related search language so the item can be found later when the original feed has moved on.
Search angles
- Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops Research context
- arXiv.org AI research
- Hardening, Agent, Benchmarks, Adversarial, Hacker-Fixer builder takeaway
- AI research papers, technical methods, and applied AI systems
This page keeps a source preview and a stable archive URL for search discovery. The original source remains authoritative.