Research

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

OmniGameArena:面向VLM游戏智能体的统一UE5基准与改善动态

arXiv logo

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

arXiv.org

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

Open source

Recommended because

This is worth tracking because it is a concrete research signal, not just a passing headline. The source preview points to a research result, method, evaluation, dataset, or safety finding. For builders and operators, "OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics" can be used as a checkpoint for technical due diligence, roadmap bets, agent design, and evaluation strategy. I keep this thread indexed so future searches around AI research papers, technical methods, and applied AI systems can land on a source-linked page instead of disappearing into a fast-moving feed from arXiv.org.

What to take from this signal

Context

"OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics" is archived here as a source-linked AI signal from arXiv.org. The useful part is the connection between OmniGameArena, Unified, UE5, Benchmark, VLM and technical due diligence, roadmap bets, agent design, and evaluation strategy, which makes the item more actionable than a normal feed headline. The source context says: Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

Builder takeaway

For an AI builder, the main takeaway is to watch how this signal changes practical decisions around technical feasibility, evaluation design, safety limits, and product primitives. It can inform what to test next, which product surface to compare, and whether the underlying workflow is ready for real users.

Source context

arXiv.org remains the authoritative source for the original claim. This page adds a stable archive URL, a short builder interpretation, and related search language so the item can be found later when the original feed has moved on.

Search angles

  • OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics Research context
  • arXiv.org AI research
  • OmniGameArena, Unified, UE5, Benchmark, VLM builder takeaway
  • AI research papers, technical methods, and applied AI systems

This page keeps a source preview and a stable archive URL for search discovery. The original source remains authoritative.