AI Agents Show Improved Capabilities but Struggle with Reliability

Mar 24, 2026

AI Summary

Recent evaluations highlight that while AI agents are becoming more capable, their reliability remains a significant concern. A study assessing various AI models found that improvements in reliability lagged behind advancements in accuracy, raising questions about their deployment in critical applications.

AI Agents Show Improved Capabilities but Struggle with Reliability

AI agents are increasingly used for tasks like research, but their performance is inconsistent.
A recent study by researchers from Princeton University examined the reliability of AI agents, focusing on four dimensions: consistency, robustness, calibration, and safety.
The study tested models released in the 18 months prior to late November 2025, including OpenAI's GPT-5.2 and Anthropic's Claude Opus 4.5.
Claude Opus 4.5 and Google’s Gemini 3 Pro scored the highest reliability at 85%, but still showed weaknesses in specific areas, such as accuracy judgment and catastrophic mistake avoidance.
The researchers emphasized the need for reliability benchmarks in AI systems, especially for applications requiring automation, where unpredictability can lead to significant risks.
An example from the healthcare sector illustrated that combining AI tools with high individual accuracy can result in low overall reliability, potentially leading to misdiagnoses in patients.
The study calls for AI developers to prioritize reliability alongside capability in their systems.

ai agentsreliabilitybenchmarkingprincetonai vendors

AI Agents Show Improved Capabilities but Struggle with Reliability

Related Stories

Nvidia Director Mark Stevens Donates $200 Million to USC for AI Research

MIT Professor Advances AI Through Game Theory and Strategic Reasoning

Mark Stevens Donates $200 Million to USC for AI Research and Education

Quality of Data is Crucial for Advancing Physical AI and World Models