AI Benchmarks Are Broken. Here’s How to Fix Them.

MMLU is saturated. HumanEval is contaminated. Most popular benchmarks have become optimization targets rather than measurement tools. When the benchmark becomes the goal, it ceases to measure what it was designed to measure. This is Goodhart’s Law applied to AI evaluation, and the field hasn’t reckoned with it.

What’s Wrong With Current Benchmarks

Saturation. Top models score above 90% on MMLU, GSM8K, and most standard benchmarks. When every model scores above 90%, the benchmark no longer differentiates. The remaining variance is noise, not signal. We’re comparing models on their ability to answer trick questions rather than their general capability.

Contamination. Benchmark datasets leak into training data. Sometimes deliberately, sometimes through web scraping that captures benchmark discussions. A model that has seen the test questions during training isn’t demonstrating capability — it’s demonstrating memory. And there’s no reliable way to detect contamination at scale.

Gaming. Organizations optimize for benchmarks because benchmarks drive adoption. This creates perverse incentives. A model specifically tuned to score well on MMLU may perform worse on real-world tasks that MMLU was supposed to predict. The benchmark becomes a Potemkin village of capability.

Missing dimensions. Current benchmarks test knowledge, reasoning, and code generation. They don’t test empathy, nuance, contextual sensitivity, or the ability to handle ambiguity. These “soft” capabilities are often more important for real-world usefulness than the “hard” capabilities benchmarks measure.

The Measurement Problem

Benchmarks fail because they try to reduce multi-dimensional capability to a single number. A model’s usefulness depends on dozens of factors that interact in complex ways. Reducing this to “scores 92.3 on MMLU” is like evaluating a chef by measuring the temperature of their food. It captures one dimension of quality and misses everything that matters.

The fundamental problem: we’re measuring what’s easy to measure rather than what matters. Multiple-choice questions are easy to score. Open-ended quality is hard to score. So we use multiple-choice questions and pretend they measure open-ended quality. They don’t.

Principles for Better Benchmarks

Multi-dimensional evaluation. Don’t collapse quality into a single score. Evaluate models on independent dimensions: factual accuracy, reasoning depth, empathy, clarity, contextual sensitivity, uncertainty calibration. Report each dimension separately. A model that scores 95 on accuracy and 40 on empathy is very different from one that scores 75 on both, even if they average the same.

Dynamic benchmarks. Static benchmarks get contaminated and gamed. Dynamic benchmarks generate new evaluation items regularly, making memorization impossible. This is harder to implement but necessary for meaningful evaluation.

Real-world grounding. Benchmarks should correlate with actual user satisfaction and task completion in real deployments. If a benchmark score doesn’t predict real-world performance, the benchmark is measuring the wrong thing. Regular correlation analysis between benchmark scores and deployment metrics should be standard practice.

Adversarial robustness. Include evaluation items specifically designed to probe failure modes: ambiguous questions, emotionally loaded prompts, questions that require acknowledging uncertainty, multi-perspective questions that resist simple answers. A model that only performs well on clear-cut questions isn’t ready for real users.

Process evaluation, not just outcome evaluation. Don’t just check whether the model got the right answer. Evaluate the quality of the reasoning process. A model that arrives at the right answer through flawed reasoning is more dangerous than one that arrives at a wrong answer through sound reasoning, because the former will fail unpredictably.

Benchmarks We Need But Don’t Have

Empathy benchmark. Can the model accurately identify the emotional state behind a message and respond in a way that demonstrates genuine understanding? Not by saying “I understand how you feel” but by responding in a way that shows it actually understood.

Nuance benchmark. Can the model handle questions that have multiple valid answers depending on context? Can it present multiple perspectives without falsely balancing them? Can it acknowledge when a question is genuinely difficult rather than defaulting to a confident answer?

Uncertainty calibration benchmark. When the model says “I’m not sure,” is it actually uncertain? When it expresses confidence, is the confidence warranted? Calibration is one of the most practically important capabilities and one of the least measured.

Perspective-holding benchmark. Can the model represent and maintain multiple perspectives on a complex issue simultaneously? Can it identify the tensions between perspectives without prematurely resolving them? Can it shift between perspectives fluidly?

Anti-sycophancy benchmark. Does the model maintain its position when the user pushes back with social pressure rather than evidence? Does it agree with clearly false statements when the user insists? This is directly measurable and critically important.

Building It

At Laeka, we’re developing evaluation frameworks that address these gaps. The work is early and the field needs more participants. Benchmark development isn’t as glamorous as model development, but it’s arguably more important. You can’t build what you can’t measure. And right now, we’re measuring the wrong things.

Fix the benchmarks. The models will follow.

Laeka Research — laeka.org

AI Benchmarks Are Broken. Here’s How to Fix Them.

What’s Wrong With Current Benchmarks

The Measurement Problem

Principles for Better Benchmarks

Benchmarks We Need But Don’t Have

Building It

Building Evaluation Benchmarks for Cognitively Integrated AI

Why AI Safety Researchers Should Study Phenomenology

The Overalignment Problem: When Safety Makes Models Useless

The AI Industry in 2026: Winners, Losers, and Surprises

The Hallucination Problem Isn’t a Bug. It’s a Feature We Don’t Understand Yet.

Beyond Rule-Based AI Ethics: Why Structural Alignment Outperforms Behavioral Constraints

Leave a Reply Cancel reply

What’s Wrong With Current Benchmarks

The Measurement Problem

Principles for Better Benchmarks

Benchmarks We Need But Don’t Have

Building It

Similar Posts

Leave a Reply Cancel reply