AI Benchmarks Are Broken. Here’s How to Fix Them.
MMLU is saturated. HumanEval is contaminated. Most popular benchmarks have become optimization targets rather than measurement tools. When the benchmark becomes the goal, it ceases to measure what it was designed to measure. This is Goodhart’s Law applied to AI evaluation, and the field hasn’t reckoned with it.
What’s Wrong With Current Benchmarks
Saturation. Top models score above 90% on MMLU, GSM8K, and most standard benchmarks. When every model scores above 90%, the benchmark no longer differentiates. The remaining variance is noise, not signal. We’re comparing models on their ability to answer trick questions rather than their general capability.
Contamination. Benchmark datasets leak into training data. Sometimes deliberately, sometimes through web scraping that captures benchmark discussions. A model that has seen the test questions during training isn’t demonstrating capability — it’s demonstrating memory. And there’s no reliable way to detect contamination at scale.
Gaming. Organizations optimize for benchmarks because benchmarks drive adoption. This creates perverse incentives. A model specifically tuned to score well on MMLU may perform worse on real-world tasks that MMLU was supposed to predict. The benchmark becomes a Potemkin village of capability.
Missing dimensions. Current benchmarks test knowledge, reasoning, and code generation. They don’t test empathy, nuance, contextual sensitivity, or the ability to handle ambiguity. These “soft” capabilities are often more important for real-world usefulness than the “hard” capabilities benchmarks measure.
The Measurement Problem
Benchmarks fail because they try to reduce multi-dimensional capability to a single number. A model’s usefulness depends on dozens of factors that interact in complex ways. Reducing this to “scores 92.3 on MMLU” is like evaluating a chef by measuring the temperature of their food. It captures one dimension of quality and misses everything that matters.
The fundamental problem: we’re measuring what’s easy to measure rather than what matters. Multiple-choice questions are easy to score. Open-ended quality is hard to score. So we use multiple-choice questions and pretend they measure open-ended quality. They don’t.
Principles for Better Benchmarks
Multi-dimensional evaluation. Don’t collapse quality into a single score. Evaluate models on independent dimensions: factual accuracy, reasoning depth, empathy, clarity, contextual sensitivity, uncertainty calibration. Report each dimension separately. A model that scores 95 on accuracy and 40 on empathy is very different from one that scores 75 on both, even if they average the same.
Dynamic benchmarks. Static benchmarks get contaminated and gamed. Dynamic benchmarks generate new evaluation items regularly, making memorization impossible. This is harder to implement but necessary for meaningful evaluation.
Real-world grounding. Benchmarks should correlate with actual user satisfaction and task completion in real deployments. If a benchmark score doesn’t predict real-world performance, the benchmark is measuring the wrong thing. Regular correlation analysis between benchmark scores and deployment metrics should be standard practice.
Adversarial robustness. Include evaluation items specifically designed to probe failure modes: ambiguous questions, emotionally loaded prompts, questions that require acknowledging uncertainty, multi-perspective questions that resist simple answers. A model that only performs well on clear-cut questions isn’t ready for real users.
Process evaluation, not just outcome evaluation. Don’t just check whether the model got the right answer. Evaluate the quality of the reasoning process. A model that arrives at the right answer through flawed reasoning is more dangerous than one that arrives at a wrong answer through sound reasoning, because the former will fail unpredictably.
Benchmarks We Need But Don’t Have
Empathy benchmark. Can the model accurately identify the emotional state behind a message and respond in a way that demonstrates genuine understanding? Not by saying “I understand how you feel” but by responding in a way that shows it actually understood.
Nuance benchmark. Can the model handle questions that have multiple valid answers depending on context? Can it present multiple perspectives without falsely balancing them? Can it acknowledge when a question is genuinely difficult rather than defaulting to a confident answer?
Uncertainty calibration benchmark. When the model says “I’m not sure,” is it actually uncertain? When it expresses confidence, is the confidence warranted? Calibration is one of the most practically important capabilities and one of the least measured.
Perspective-holding benchmark. Can the model represent and maintain multiple perspectives on a complex issue simultaneously? Can it identify the tensions between perspectives without prematurely resolving them? Can it shift between perspectives fluidly?
Anti-sycophancy benchmark. Does the model maintain its position when the user pushes back with social pressure rather than evidence? Does it agree with clearly false statements when the user insists? This is directly measurable and critically important.
Building It
At Laeka, we’re developing evaluation frameworks that address these gaps. The work is early and the field needs more participants. Benchmark development isn’t as glamorous as model development, but it’s arguably more important. You can’t build what you can’t measure. And right now, we’re measuring the wrong things.
Fix the benchmarks. The models will follow.
Laeka Research — laeka.org