{"id":276,"date":"2026-03-21T18:37:12","date_gmt":"2026-03-21T18:37:12","guid":{"rendered":"https:\/\/lab.laeka.org\/?p=276"},"modified":"2026-03-21T18:37:12","modified_gmt":"2026-03-21T18:37:12","slug":"the-chinchilla-scaling-laws-are-wrong-heres-what-replaced-them","status":"publish","type":"post","link":"https:\/\/laeka.org\/publications\/the-chinchilla-scaling-laws-are-wrong-heres-what-replaced-them\/","title":{"rendered":"The Chinchilla Scaling Laws Are Wrong. Here&#8217;s What Replaced Them."},"content":{"rendered":"<p>In 2022, DeepMind&#8217;s Chinchilla paper reshaped the AI industry. The claim: for compute-optimal training, scale parameters and data tokens equally. A 70B model needs ~1.4T tokens. The industry rearranged itself around this law. Then Llama proved it wrong.<\/p>\n<h2>What Chinchilla Actually Said<\/h2>\n<p>The Chinchilla scaling laws established a ratio: for a given compute budget, the optimal allocation between model parameters and training tokens follows a roughly 1:20 ratio. A 10B model should see ~200B tokens. A 70B model should see ~1.4T tokens. Spending compute on more parameters without proportionally more data, or vice versa, wastes resources.<\/p>\n<p>This was a correction to GPT-3&#8217;s approach, which was massively over-parameterized relative to its training data. Chinchilla 70B, trained on the &#8220;right&#8221; amount of data, matched GPT-3 175B&#8217;s performance with less than half the parameters. The implication was clear: the industry had been building models that were too big and training them on too little data.<\/p>\n<p>Labs took notice. Training runs were redesigned around the Chinchilla ratio. The &#8220;compute-optimal&#8221; framing became gospel.<\/p>\n<h2>Where Chinchilla Goes Wrong<\/h2>\n<p>Chinchilla optimizes for <strong>training compute<\/strong>, not <strong>total lifecycle cost<\/strong>. This is a critical distinction. Training a model happens once. Running inference happens millions of times. A smaller model trained on more data costs more to train but costs dramatically less to deploy.<\/p>\n<p>Llama demonstrated this brilliantly. Llama 1 7B was trained on 1T tokens \u2014 roughly 7x the Chinchilla-optimal amount. Llama 2 7B saw 2T tokens. Llama 3.1 8B consumed 15T tokens. Each version was &#8220;over-trained&#8221; by Chinchilla standards, yet each was better than the last.<\/p>\n<p>The reason: when you care about inference cost, you want the smallest model that hits your quality target. Over-training a small model beyond the Chinchilla ratio produces a model that&#8217;s cheaper to run but nearly as good as a larger, Chinchilla-optimal model. The extra training compute is a one-time cost that pays dividends every time the model serves a request.<\/p>\n<h2>The Inference-Optimal Scaling Laws<\/h2>\n<p>Researchers at institutions including Meta, Hugging Face, and several universities have developed revised scaling laws that account for inference cost. The framework is called <strong>inference-optimal scaling<\/strong> or sometimes &#8220;deployment-aware scaling.&#8221;<\/p>\n<p>The insight: given a fixed inference budget (cost per token in production), the optimal training strategy is to train a smaller model on significantly more data than Chinchilla recommends. How much more depends on your expected inference volume.<\/p>\n<p>For a model that will serve billions of requests, the optimal training-to-parameter ratio might be 100:1 or even 200:1 \u2014 10x the Chinchilla recommendation. The extra training cost is amortized across so many inference calls that it becomes negligible.<\/p>\n<p>This explains the industry trend toward smaller, heavily trained models. It&#8217;s not that labs forgot Chinchilla. They&#8217;re optimizing for a different objective: <strong>minimum total cost of ownership<\/strong> rather than minimum training cost.<\/p>\n<h2>Data Quality Changes the Equation<\/h2>\n<p>The other factor Chinchilla didn&#8217;t account for is <strong>data quality variation<\/strong>. The original scaling laws assumed roughly uniform data quality. In practice, the first trillion tokens of high-quality data teach more per token than the second trillion, which teaches more than the third.<\/p>\n<p>This means the scaling curves aren&#8217;t smooth power laws \u2014 they have inflection points where adding more data of declining quality yields diminishing returns faster than the theory predicts. The Phi models proved that a small model on high-quality data can match a larger model on lower-quality data, breaking the Chinchilla relationship entirely.<\/p>\n<p>Modern scaling research treats data quality as a variable in the scaling equations rather than a constant. The resulting predictions are more complex but more accurate: the optimal model size and data quantity depend on the quality distribution of available training data, not just the total compute budget.<\/p>\n<h2>What This Means in Practice<\/h2>\n<p>For organizations training models, the practical implications are clear. <strong>Don&#8217;t follow Chinchilla ratios blindly.<\/strong> Instead, consider your deployment scenario:<\/p>\n<p>If you&#8217;re training a model for a specific, high-volume production use case, train a smaller model on much more data than Chinchilla suggests. The inference savings will far exceed the extra training cost.<\/p>\n<p>If you&#8217;re training a research model that will be evaluated on benchmarks and then mostly shelved, Chinchilla ratios are fine. Training compute dominates when inference volume is low.<\/p>\n<p>If you&#8217;re working with limited, high-quality data and can&#8217;t easily get more, a larger model trained on less data may be optimal. The model needs enough parameters to absorb the knowledge in your data, and there&#8217;s a minimum dataset size below which smaller models waste capacity.<\/p>\n<h2>The Bigger Picture<\/h2>\n<p>Scaling laws aren&#8217;t physical constants. They&#8217;re empirical observations that depend on assumptions about architecture, data, hardware, and optimization \u2014 all of which change over time. Chinchilla was right for its context (training compute optimization in 2022). It&#8217;s wrong for today&#8217;s context (total cost optimization with inference-dominant workloads).<\/p>\n<p>The lesson isn&#8217;t that scaling laws are useless. It&#8217;s that you need to understand <strong>which variable they&#8217;re optimizing<\/strong> and whether that variable matches your objective. Blindly applying someone else&#8217;s scaling law to your problem is a fast path to suboptimal results.<\/p>\n<p>For ongoing research on training efficiency and model scaling, visit <a href='https:\/\/lab.laeka.org'>Laeka Research<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In 2022, DeepMind&#8217;s Chinchilla paper reshaped the AI industry. The claim: for compute-optimal training, scale parameters and data tokens equally. A 70B model needs ~1.4T tokens. The industry rearranged itself around this law. Then&#8230;<\/p>\n","protected":false},"author":1,"featured_media":275,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","footnotes":""},"categories":[243],"tags":[],"class_list":["post-276","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-architecture"],"_links":{"self":[{"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/posts\/276","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/comments?post=276"}],"version-history":[{"count":1,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/posts\/276\/revisions"}],"predecessor-version":[{"id":432,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/posts\/276\/revisions\/432"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/media\/275"}],"wp:attachment":[{"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/media?parent=276"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/categories?post=276"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/laeka.org\/publications\/wp-json\/wp\/v2\/tags?post=276"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}