Forecasting How Good a Language Model Will Be

Building a language model is really a long series of comparative decisions. Which pretraining corpus is better? Does this new post-training recipe actually help on the target domain? Is this architecture worth the switch? Answering any of these well requires a forecast of the downstream performance for the tasks we care about. However, the two signals we commonly use for this are flawed:

Cross-entropy loss is wonderfully smooth: it scales predictably with compute and extrapolates with remarkable fidelity. But it is only weakly tied to the capabilities we care about. For instance, two models with nearly identical loss may behave very differently on a downstream task.1

Direct downstream evaluation is expensive, sparse, and often uninformative (e.g., at the start of training). Frontier evaluations may require human experts, code execution, or an external environment that may be inaccessible. Prior works on downstream performance estimation have often relied on benchmark performances to fit scaling laws.2, 3

In this work, we ask: is there a different signal? — one that is as smooth as loss, but as task-conditioned as evaluation?

The Idea

For the target task of interest, we assume access to expert solutions, i.e., worked-out reasoning traces, written by a human or a strong model. We believe it will always be easier to get access to a handful of expert reasoning trajectories as opposed to the full downstream environment. Our hypothesis is that by observing a model's distribution over such expert solutions, we can infer its relative capability on the task. For instance, a model that cannot yet solve a hard math problem on its own may still assign high probability to some of the crucial steps once it sees them in context.

We pass an expert trajectory through a candidate model in a single forward pass (no generation, so it is extremely cheap) and compute token-level statistics of its next-token distribution: for e.g., the entropy at each step, whether the expert's token is in the model's top-k, the rank of the expert token, and so on. These are then aggregated with weights that emphasize important positions: for e.g., low-frequency tokens, or places where the model is most uncertain. The result is a proxy metric: a single number that is cheap to compute and tightly conditioned on the task.

Crucially, the expert only needs to provide text. We never need its log-probabilities and never touch the downstream evaluator, so the same recipe works with human-written solutions or with traces from closed-weight frontier models.

Schematic of the proxy metric method: an expert trajectory is read by the candidate model; at each expert token we record a core metric (e.g., the rank of the expert token) and a weighting term (e.g., entropy), and their scaled dot product yields the proxy metric. — At each token of the expert's solution, we read off a *core metric* (e.g., the rank of the expert token under the model's distribution) and a *weighting* term (e.g., entropy). Their weighted combination is the proxy metric.

Result 1 — Ranking models without running the benchmark

First, the practical question of model selection: given a pool of heterogeneous models, can we rank them on a downstream task without evaluating them on it? We fit a simple linear ranker over the proxy features on a set of held-in tasks and test it on held-out reasoning benchmarks. It recovers the true ranking with mean Spearman ρ = 0.81, compared with just 0.36 for cross-entropy loss.

Six scatter plots (GPQA, MMLU-Pro, USACO, SuperGPQA, AIME, HMMT) showing benchmark score against the proxy metric, with tight monotonic relationships; mean cross-validated Spearman rho is 0.81. — On held-out reasoning benchmarks, the proxy metric tracks true benchmark score closely across very different tasks (mean cross-validated Spearman ρ = 0.81).

Result 2 — Choosing pretraining data with ~10,000× less compute

Next, a decision made before committing real compute: which pretraining corpus should you use? On the DataDecide testbed, the goal is to rank 25 candidate corpora the way a 1B target model would rank them by downstream accuracy. We train tiny proxy models (4M–90M parameters) and score them with our proxies. This recovers the corpus ranking about as well as direct downstream evaluation while using order of 10,000× less compute, pushing the cost–accuracy frontier beyond prior methods.

Decision accuracy versus proportion of target compute. The frequency-weighted top-5 accuracy proxy reaches decision accuracy above 0.85 at roughly 1e-5 of target compute, far to the left of the downstream-performance baseline. — Decision accuracy (fraction of corpus pairs ranked correctly) vs. compute. Our proxy reaches the quality of direct downstream evaluation while sitting orders of magnitude further left on the compute axis.

Result 3 — Forecasting accuracy across training

Finally, forecasting through training. We show that we can find good proxy metrics that follow clean power laws along both pretraining and post-training trajectories. Additionally, when a handful of early benchmark scores are available, fitting downstream accuracy as a function of the proxy (rather than of loss or raw compute) extrapolates across an 18× compute horizon with roughly half the error of those alternatives.

Extrapolating HellaSwag accuracy along the OLMo-3-7B pretraining trajectory. The proxy power-law fit (RMSE = 0.003) tracks the target far more closely than the CE loss exponential (RMSE = 0.09)

Why this matters

The thread running through all three settings is the same: generic CE loss is smooth but task-agnostic; direct evaluation is task-specific but expensive and often uninformative early on. Proxy metrics computed over expert trajectories can give us both smoothness and task-conditioning in a single, cheap forward pass. I believe that expert solutions are an underused source of evaluation signal. Studying how a model's predictions differs from an expert's reasoning, step by step, tells us a surprising amount about what that model is capable of.

References

Liu et al. Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models. ICML 2023. link ↩
Ruan et al. Observational Scaling Laws and the Predictability of Language Model Performance. NeurIPS 2024. link ↩
Owen. How Predictable Is Language Model Benchmark Performance?. arXiv 2024. link ↩