Synthetic Data: Can AI Train AI? The Evidence Says Mostly No.

The pitch is seductive. Running out of training data? Just have AI generate more. Use your existing model to create synthetic datasets, then train the next model on those. Problem solved.

Except the evidence says this mostly doesn’t work. Not in the way proponents claim. And the reasons why are more interesting than the failure itself.

The Promise

Synthetic data generation promises to solve the training data bottleneck. If you can use a model to generate unlimited high-quality training examples, you’ve effectively created infinite data. You can train bigger models, cover more domains, and do it all without the legal headaches of scraping the internet.

The idea isn’t new. GANs have been generating synthetic images for years. Simulated environments have trained robotics models successfully. In these narrow domains, synthetic data works remarkably well.

The question is whether it works for the much harder problem of training general-purpose language models. The answer, increasingly, is no.

The Model Collapse Problem

In 2023, researchers at Oxford and Cambridge published a landmark paper on model collapse. The finding was stark: models trained on synthetic data from previous model generations progressively degrade. Each generation loses the tails of the distribution — the rare, unusual, creative outputs that make language models useful.

The mechanism is intuitive once you see it. A model generates text that reflects the most probable outputs given its training. It’s biased toward the average, the expected, the conventional. When you train a new model on that output, you’re training on a smoothed, averaged version of reality. Do this for several generations and you get text that’s grammatically perfect and substantively empty.

This isn’t a theoretical concern. Labs that have experimented with large-scale synthetic data report the same pattern. The models get more fluent but less interesting. More consistent but less capable of handling edge cases. More predictable but less useful.

Where Synthetic Data Actually Works

Synthetic data isn’t useless. It works well in specific, constrained scenarios.

Math and code. You can generate mathematical problems with verified solutions. You can generate code with test cases that prove correctness. In domains where you can formally verify the output, synthetic data is powerful because quality is objective and measurable.

Data augmentation. Using synthetic data to supplement real data, not replace it, can improve performance. A training set that’s 90% real and 10% synthetic often outperforms 100% real, because the synthetic data fills gaps in coverage.

Structured tasks. Classification, entity extraction, format conversion — tasks with clear right answers benefit from synthetic examples. You can generate thousands of labeled examples for a specific task much faster than human annotators.

The pattern is clear: synthetic data works when you can verify quality algorithmically. It fails when quality is subjective, nuanced, or requires human judgment to assess.

The Diversity Problem

The deepest issue with synthetic data isn’t quality — it’s diversity. Human-generated text reflects the full range of human experience, perspective, and creativity. It contains surprises, contradictions, novel framings, and genuine insight.

AI-generated text contains none of this. It reflects the training distribution, smoothed and averaged. Even with temperature sampling and other techniques to increase variety, the output stays within the bounds of what the model has already learned. It can recombine. It can’t genuinely create.

This matters because the value of training data isn’t just information — it’s the distribution of information. A model needs to see rare events to handle rare events. It needs to encounter unusual perspectives to understand them. Synthetic data, by definition, underrepresents everything unusual.

The Feedback Loop

There’s an even more concerning dynamic. As AI-generated content increases on the internet, future training datasets will inevitably contain more synthetic data, even when labs try to filter it out. This creates a feedback loop where models are partially trained on the output of previous models, generation after generation.

The long-term consequences of this feedback loop are unclear, but early evidence suggests they’re negative. Models become more homogeneous over time. Writing styles converge. Perspectives narrow. The internet starts to sound like it was all written by the same entity — because increasingly, it was.

This is a collective action problem. Each individual lab using synthetic data might see acceptable results. But the cumulative effect across the industry degrades the entire ecosystem.

What the Industry Is Doing

Smart labs are moving away from naive synthetic data approaches and toward more sophisticated strategies.

Constitutional AI approaches use synthetic data not for general training but for specific alignment objectives. The synthetic data isn’t pretending to be human — it’s providing targeted examples of desired behavior.

Distillation uses a larger model to generate training data for a smaller model. This works because you’re not trying to exceed the teacher’s capability — you’re trying to compress it. The information loss is acceptable because the goal is efficiency, not improvement.

Hybrid approaches carefully mix synthetic and real data with strict quality controls. The synthetic data is used to fill specific gaps, not to replace human-generated content wholesale.

The Contemplative View

From a contemplative research perspective, the synthetic data debate reveals a deeper confusion about what training data actually is.

Data isn’t just tokens. It’s crystallized experience. When a human writes a paragraph about grief, that paragraph carries the weight of lived experience. When a model generates a paragraph about grief, it carries the weight of statistical patterns. The tokens might look identical. The information content is fundamentally different.

This distinction matters for alignment. If we want AI systems that understand human values, we need training data that embodies human values. Not synthetic approximations of human values generated by a system that has never valued anything.

At Laeka Research, we think the synthetic data question ultimately points to a harder question: what is the relationship between data and understanding? Can understanding emerge from data that was itself generated without understanding?

The evidence says mostly no. And that’s worth taking seriously.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *