The Expert Problem: Why PhDs Make Worse Annotators Than Practitioners
You’d think experts would make the best annotators. They understand the domain deeply. They can evaluate quality with precision. They know what good looks like. But in practice, domain experts consistently produce worse training data than skilled practitioners. The reasons reveal something fundamental about the gap between knowledge and wisdom.
The Curse of Knowledge
Experts know too much to evaluate responses the way real users would. When a PhD in psychology evaluates an AI’s response to someone describing anxiety, they assess it against clinical standards. They penalize responses that don’t use precise diagnostic language. They reward technical accuracy.
But the person experiencing anxiety doesn’t want a clinical assessment. They want to feel heard. They want practical help. They want a response that meets them where they are, not where the DSM-5 says they should be.
Expert annotations train models to sound like experts. That’s useful when the user is also an expert. For everyone else, it produces responses that are technically correct and humanly useless.
The Practitioner Advantage
Practitioners — therapists, coaches, teachers, customer service professionals, contemplative practitioners — interact with real people daily. They know what works in practice, not just what’s correct in theory. Their annotations reflect real-world effectiveness rather than academic standards.
A practicing therapist knows that sometimes the most helpful response isn’t the most accurate one. It’s the one that opens a door. The one that creates safety. The one that invites the person to explore further. These qualities are invisible to expert evaluation frameworks but essential to actual helpfulness.
Practitioners also understand failure modes from experience. They know what responses shut people down, what creates defensiveness, what feels patronizing. This knowledge can’t be formalized in annotation guidelines. It lives in the practitioner’s embodied understanding of human interaction.
What Goes Wrong With Expert Annotation
Over-specificity. Experts penalize responses for being imprecise about details that users don’t care about. A medical expert might reject a response because it says “blood pressure medication” instead of “angiotensin-converting enzyme inhibitor.” The user wanted to understand their prescription, not pass a pharmacology exam.
Jargon bias. Experts reward responses that use domain terminology. They’ve spent years learning this vocabulary and associate it with competence. But for most users, jargon is a barrier. The best response is one that communicates clearly in the user’s language, not the expert’s.
Missing the relational dimension. Experts evaluate content. Practitioners evaluate the whole interaction. Content accuracy is necessary but not sufficient. The relational quality — whether the response creates connection or distance — is often more important for user satisfaction than technical precision.
Anchoring to ideal responses. Experts compare AI responses to what they would say. But expert responses aren’t always the best responses for a general audience. A response calibrated for a colleague is different from a response calibrated for a first-time user. Experts have trouble making this distinction because expertise is their default mode.
The Evidence From Laeka
We’ve run controlled experiments comparing expert and practitioner annotations on the same datasets. The results are consistent.
Models trained on expert-annotated DPO pairs score higher on domain-specific benchmarks. Models trained on practitioner-annotated DPO pairs score higher on user satisfaction, helpfulness ratings, and real-world task completion. The gap is significant — typically 15-25% on user-facing metrics.
The most interesting finding: models trained on practitioner annotations also perform reasonably well on domain benchmarks — not as high as expert-trained models, but within an acceptable range. Models trained on expert annotations perform poorly on user satisfaction. Practitioners produce data that’s good enough technically and excellent relationally. Experts produce data that’s excellent technically and poor relationally.
The Optimal Annotation Team
The solution isn’t to exclude experts entirely. It’s to compose annotation teams deliberately. Practitioners should form the majority of the team, providing the baseline of practical, user-facing quality. Experts should provide quality checks on technical accuracy, catching factual errors that practitioners might miss.
The ratio that works best in our experience: roughly 70% practitioners, 30% experts. Practitioners annotate first. Experts review for accuracy. Disagreements are resolved in favor of the practitioner’s judgment on relational quality and the expert’s judgment on factual accuracy.
This composite approach produces training data that’s both technically sound and humanly useful. Neither group alone achieves this balance.
The Contemplative Practitioner as Annotator
Contemplative practitioners occupy a unique position in this framework. They combine the deep domain knowledge of experts (in the domain of human cognition and emotion) with the practical engagement of practitioners (through their daily practice and often through their work as teachers, therapists, or counselors).
Their annotations capture qualities that neither pure experts nor pure practitioners notice: the subtle emotional dynamics of an interaction, the meta-cognitive qualities of a response, the degree to which a response invites growth rather than dependency.
The best training data comes from people who know their domain deeply AND engage with real humans daily AND have trained their own attention and awareness systematically. That intersection is rare. But it’s where the highest-quality annotations live.
Laeka Research — laeka.org