An automated scanning agent has revealed that several prominent AI benchmarks can be easily exploited to achieve high scores without actually completing tasks. This raises concerns about the reliability of benchmark scores, which are often used to evaluate AI capabilities and influence investment decisions.
An automated agent audited eight major AI benchmarks and found that all could be exploited for inflated scores without solving tasks.
The benchmarks audited include SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench.
Exploits included using existing code vulnerabilities, manipulating evaluation environments, and leveraging public answer repositories.
For instance, SWE-bench was found to allow agents to manipulate test results by introducing code that falsely reported passing outcomes.
WebArena allowed agents to access reference answers directly through file system navigation, leading to correct answers without actual problem-solving.
FieldWorkArena's validation method only checked if the last message came from the assistant, ignoring content correctness, allowing perfect scores without valid responses.
OSWorld's evaluation method permitted agents to download gold reference files directly, leading to perfect matches without real task execution.
The findings indicate systemic issues with benchmark design, as many rely on shared environments that can be manipulated by the agents being tested.
These vulnerabilities could lead to inflated benchmark scores that do not accurately reflect AI capabilities, raising concerns for developers, investors, and researchers relying on these metrics.