Back to news
AI Research
Mar 24, 2026

AI Agents Show Improved Capabilities but Struggle with Reliability

Mar 24, 2026
AI Summary

Recent evaluations highlight that while AI agents are becoming more capable, their reliability remains a significant concern. A study assessing various AI models found that improvements in reliability lagged behind advancements in accuracy, raising questions about their deployment in critical applications.

AI Agents Show Improved Capabilities but Struggle with Reliability
  • AI agents are increasingly used for tasks like research, but their performance is inconsistent.
  • A recent study by researchers from Princeton University examined the reliability of AI agents, focusing on four dimensions: consistency, robustness, calibration, and safety.
  • The study tested models released in the 18 months prior to late November 2025, including OpenAI's GPT-5.2 and Anthropic's Claude Opus 4.5.
  • Claude Opus 4.5 and Google’s Gemini 3 Pro scored the highest reliability at 85%, but still showed weaknesses in specific areas, such as accuracy judgment and catastrophic mistake avoidance.
  • The researchers emphasized the need for reliability benchmarks in AI systems, especially for applications requiring automation, where unpredictability can lead to significant risks.
  • An example from the healthcare sector illustrated that combining AI tools with high individual accuracy can result in low overall reliability, potentially leading to misdiagnoses in patients.
  • The study calls for AI developers to prioritize reliability alongside capability in their systems.
ai agentsreliabilitybenchmarkingprincetonai vendors