Research

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

CyberGym-E2E:AI智能体端到端网络安全能力的大规模真实世界基准

Center for Responsible, Decentralized Intelligence at Berkeley

rdi.berkeley.eduOpen source

Recommended because

This is worth tracking because it is a concrete research signal, not just a passing headline. The original source is useful for validating the details behind the headline. For builders and operators, "CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities" can be used as a checkpoint for technical due diligence, roadmap bets, agent design, and evaluation strategy. I keep this thread indexed so future searches around AI research papers, technical methods, and applied AI systems can land on a source-linked page instead of disappearing into a fast-moving feed from rdi.berkeley.edu.

What to take from this signal

Context

"CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities" is archived here as a source-linked AI signal from rdi.berkeley.edu. The useful part is the connection between CyberGym-E2E, Scalable, Real-World, Benchmark, Agents and technical due diligence, roadmap bets, agent design, and evaluation strategy, which makes the item more actionable than a normal feed headline. The source context says: CyberGym-E2E 是一个包含920个真实漏洞、覆盖139个开源项目的大规模端到端网络安全基准。任务要求AI智能体在真实代码库中自行定位漏洞、生成触发崩溃的概念验证并编写补丁。测试表明:若直接给出漏洞位置,最强配置可修复约80%漏洞;但若需自行发现,端到端成功率急剧下降--Claude Opus 4.5仅19.2%,最新模型在37%-66%之间。智能体可能发现替代漏洞,且存在部分浅层补丁。所有漏洞已事先公开披露并修复。

Builder takeaway

For an AI builder, the main takeaway is to watch how this signal changes practical decisions around technical feasibility, evaluation design, safety limits, and product primitives. It can inform what to test next, which product surface to compare, and whether the underlying workflow is ready for real users.

Source context

rdi.berkeley.edu remains the authoritative source for the original claim. This page adds a stable archive URL, a short builder interpretation, and related search language so the item can be found later when the original feed has moved on.

Search angles

  • CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities Research context
  • rdi.berkeley.edu AI research
  • CyberGym-E2E, Scalable, Real-World, Benchmark, Agents builder takeaway
  • AI research papers, technical methods, and applied AI systems

This page keeps a source preview and a stable archive URL for search discovery. The original source remains authoritative.