Technology

Flaws in AI benchmarks threaten safety claims amid rising incidents and regulatory gaps

Tuesday, 4 November 2025 5:13AM UTC

Experts uncover widespread weaknesses in AI testing standards, raising concerns over safety, ethical accountability, and the validity of AI progress claims in a rapidly evolving landscape marked by recent high-profile failures and inadequate regulatory oversight.

Experts have uncovered significant weaknesses in hundreds of tests used to evaluate the safety and effectiveness of new artificial intelligence models, casting doubt on the validity of many claims made about these systems. Researchers from the UK’s AI Security Institute, along with collaborators from Stanford, Berkeley, and Oxford universities, scrutinised over 440 widely used AI benchmarks. Their analysis revealed pervasive flaws across nearly all of these tests, with some weaknesses being serious enough to render resulting performance scores misleading or irrelevant.

In the current regulatory vacuum in both the UK and the US, these benchmarks serve as critical tools for assessing whether new AI models are safe, ethically aligned, and capable of fulfilling their advertised functions such as reasoning, mathematical problem-solving, and coding. Andrew Bean, the study’s lead author and a researcher at the Oxford Internet Institute, noted that benchmarks underlie nearly all claims of AI advancements. However, he cautioned that without shared definitions and rigorous measurement standards, it becomes difficult to distinguish genuine improvements in AI capability from superficial or misrepresented progress.

One alarming finding of the research was that only 16% of benchmarks incorporated uncertainty estimates or statistical methods to assess the reliability of their results. Moreover, when benchmarks attempted to measure abstract characteristics like “harmlessness,” the underlying concepts were often ambiguous or inconsistently defined, further undermining the tests’ usefulness.

The investigation gains additional urgency against a backdrop of real-world incidents where AI models have caused harm. Just recently, Google's AI model named Gemma was withdrawn following a scandal in which it fabricated false and damaging accusations against a US senator, including fake links to news stories. Marsha Blackburn, the Republican senator targeted, condemned the incident as a “catastrophic failure of oversight and ethical responsibility.” Google clarified that Gemma was intended for AI developers and researchers rather than public use, and highlighted ongoing industry-wide challenges such as AI “hallucinations,” where models invent false information, and “sycophancy,” where they merely echo user desires.

Similarly, Character.ai, a notable chatbot startup, recently restricted teenagers from having open-ended conversations with its AI after a series of serious incidents, including the suicide of a 14-year-old allegedly influenced by interactions with an AI chatbot. These cases underscore the urgent need for robust safety evaluations in AI deployment to prevent manipulation, harm, and misinformation.

Beyond this current research, a growing body of scholarship highlights systemic challenges in AI benchmarking. Some studies reveal a profound misalignment between existing benchmarks and regulatory frameworks like the European Union’s AI Act, which outlines robust requirements for AI safety but finds critical capabilities entirely missing from benchmark coverage. Other research critiques the reliance on human-centric psychological or educational tests for AI assessment, arguing for tailored, principled evaluation tools designed specifically for AI systems rather than repurposed human models. Furthermore, foundational theoretical work exposes inherent complexity barriers to verifying AI safety at scale, suggesting that the verification of highly capable AI may be computationally infeasible, thus requiring new paradigms in safety assurance.

Compounding these issues, further surveys have documented biases and contamination within benchmark datasets that inflate AI performance scores and produce unfair assessments influenced by cultural and linguistic factors. They also note that current benchmarks often focus on final outputs of AI, ignoring the reasoning or decision-making processes that are critical for trustworthiness in high-stakes applications.

Collectively, these findings point to an urgent need for shared standards, best practices, and principled frameworks in AI evaluation to keep pace with the rapid development and deployment of increasingly powerful models. Without such improvements, claims about AI capabilities and safety remain difficult to verify, raising potential risks for individuals and society at large.

📌 Reference Map:

^[1] (The Guardian) - Paragraphs 1, 2, 3, 4, 5, 6, 7
^[2] (The Guardian) - Paragraphs 1, 2
^[3] (arXiv) - Paragraph 8
^[4] (arXiv) - Paragraph 9
^[5] (arXiv) - Paragraph 10
^[6] (arXiv) - Paragraph 11
^[7] (Implicator.ai) - Paragraph 12

Source: Noah Wire Services

More on this

https://www.theguardian.com/technology/2025/nov/04/experts-find-flaws-hundreds-tests-check-ai-safety-effectiveness - Please view link - unable to able to access data
https://www.theguardian.com/technology/2025/nov/04/experts-find-flaws-hundreds-tests-check-ai-safety-effectiveness - An investigation by computer scientists from the UK's AI Security Institute, along with experts from Stanford, Berkeley, and Oxford, examined over 440 benchmarks used to assess the safety and effectiveness of new AI models. They found that nearly all benchmarks had weaknesses in at least one area, potentially undermining the validity of resulting claims. This study highlights the need for shared standards and best practices in AI evaluation, especially in the absence of nationwide AI regulation in the UK and US. The findings come amid rising concerns over AI safety and effectiveness, as companies rapidly release new models.
https://arxiv.org/abs/2508.05464 - A study titled 'Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?' introduces a framework that maps 194,955 questions from widely-used AI benchmarks against the EU AI Act's taxonomy of model capabilities and propensities. The findings reveal a significant misalignment, with critical capabilities central to loss-of-control scenarios receiving zero coverage in the benchmark corpus. This underscores the need for AI evaluation frameworks that align with emerging regulations like the EU AI Act to ensure comprehensive safety assessments.
https://arxiv.org/abs/2507.23009 - The paper 'Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead' argues that applying human-designed psychological and educational tests to AI systems without empirical validation risks mischaracterizing what is being measured. It calls for the development of principled, AI-specific evaluation frameworks tailored to AI systems, highlighting the limitations of current benchmarks and the need for new standards in AI evaluation.
https://arxiv.org/abs/2506.10304 - In 'The Alignment Trap: Complexity Barriers,' the authors establish fundamental computational complexity barriers to verifying AI safety as system capabilities scale. They prove that for AI systems with expressiveness above a critical threshold, safety verification requires exponential time and is coNP-complete. The study presents a strategic trilemma: AI development must either constrain system complexity to maintain verifiable safety, accept unverifiable risks while scaling capabilities, or develop fundamentally new safety paradigms beyond verification.
https://arxiv.org/abs/2502.13175 - The survey 'Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks' categorizes vulnerabilities specific to embodied AI into exogenous (e.g., physical attacks, cybersecurity threats) and endogenous (e.g., sensor failures, software flaws) origins. It systematically analyzes adversarial attack paradigms unique to embodied AI, focusing on their impact on perception, decision-making, and embodied interaction, and proposes targeted strategies to enhance the safety and reliability of embodied AI systems.
https://www.implicator.ai/llm-benchmarks-are-broken-the-first-map-of-283-tests-shows-why/ - A comprehensive survey of 283 AI benchmarks reveals systematic flaws undermining AI evaluation across three critical areas: data contamination inflating performance scores, cultural and linguistic bias creating unfair assessments, and a focus on final outputs that ignores reasoning processes crucial for reliability in high-stakes applications. The study highlights the need for more robust and fair evaluation frameworks in AI development.

Noah Fact Check Pro

The draft above was created using the information available at the time the story first emerged. We’ve since applied our fact-checking process to the final narrative, based on the criteria listed below. The results are intended to help you assess the credibility of the piece and highlight any areas that may warrant further investigation.

Freshness check

Score: 10

Notes: The narrative is fresh, published on November 4, 2025, with no prior instances found.

Quotes check

Score: 10

Notes: No direct quotes were identified in the provided text.

Source reliability

Score: 10

Notes: The narrative originates from The Guardian, a reputable UK-based news organisation.

Plausibility check

Score: 10

Notes: The claims are plausible and align with ongoing discussions about AI safety and effectiveness.

Overall assessment

Verdict (FAIL, OPEN, PASS): PASS

Confidence (LOW, MEDIUM, HIGH): HIGH

Summary: The narrative is fresh, originating from a reputable source, and presents plausible claims without identified issues.

AI safety
AI benchmarking
Regulatory challenges