The end of benchmarks

samrodriques
Nov 28, 2025
4 min read

Imagine a society of apes trying to play tic tac toe. One day, they meet a very good human player, who also has a computer that has a superintelligent AI on it, trained to play tic tac toe. From their perspective, both the human and the computer perform equally well; they can stalemate each other every time, and they often beat the apes. Even if the superintelligent AI is much better than the human at, stay, Go, the apes could not tell.

This is an example of the problem of intelligence saturation. If the hardest task you perform in your society is tic tac toe, you cannot distinguish between two intelligences that both saturate performance on tic tac toe. In our case, we are far from intelligence saturation, but it is certainly possible to imagine that, at some point, we will have models that saturate virtually all tasks that humans perform today, at which point we will no longer be able to tell which model is more intelligent, at least on any tasks we care about. Over the past 6 months, I have come to believe that that day is coming sooner than we think. The solution will be increasing amounts of human labor dedicated to evaluation on real-world tasks, which may soak up some of the slack in the labor market that results from productivity gains.

In the development of LLMs, we started out benchmarking them with simple multiple choice questions like MMLU. Those saturate quickly, so we have now transitioned to open-answer benchmarks that either have verifiable answers (think ARC-AGI) or that can be evaluated using rubrics. In many areas, though, the models already seem to be encountering the tic tac toe problem: the benchmarks are saturating, and it is becoming progressively harder to make new, unsaturated benchmarks.

We see this extensively in science. Two years ago, knowledge- and reasoning-based questions easily stumped GPT-3.5 or GPT-4. Today, in order to make hard benchmarks, we have to create long, multi-step problems that combine multimodal inputs and outputs. To be clear, we are a long way away from superintelligence in science or programming. Vibecoding is good for 0 to 1 demos, but falls on its face when it comes to building a large, production-ready codebase. The models still have terrible taste in science, and cannot distinguish between results that are likely robust and results that are likely noise. But it seems like we may only be a few years away from a future in which even objectively verifiable evaluations or rubric-based evals of the kind we are using today are no longer possible, because the models will simply be too good.

What happens then? The next stage in the evolution of language model evaluation will be real-world evals, in which you measure the performance of language models on the actual tasks you want them to perform. In science, for example, you might measure their ability to actually write a paper that can be accepted at Nature or Science from scratch, including doing the experiments, writing the code, and so on. In software engineering, you might measure their ability to create a video game from scratch. Unlike the kind of evals we have today, these evals will be extremely time-consuming to run (e.g., the model might need to run some experiments) and extremely time-consuming to evaluate (e.g., you might need to actually play the game), which will significantly increase the cost of evaluation, both in money and in time.

In some cases, there will be shortcuts: we can get the models to reproduce existing scientific papers, to recreate existing video games from scratch, or to predict the outcomes of clinical trials that have been run but that have not been published yet. But these shortcuts will only go so far, since they will restrict evaluation to the limited subset of tasks that humans have already performed (when in reality you want evaluation to be performed over the entire distribution), they will place an upper bound on the performance we can measure (i.e., we won't be able to measure performance above what humans can do), and the evaluations will be spoiled once the product being reproduced is in the public domain and can in principle be accessed by the model or agent being evaluated. In real-world evals, there is no substitute for actually letting the model do the thing itself, and measuring the result; and in many cases (e.g. video games, clinical trials), measuring the result will take an extraordinary amount of human labor.

(Once we saturate real-world evals, we are effectively done. If we can saturate problems like "build me a spacecraft that can fly me at relativistic speeds to alpha centauri," then I think we can say confidently that we have solved AI.)

The upshot of this all is: as we get closer to ASI, we will increasingly encounter more sophisticated variants of the tic tac toe problem, and we will need to dedicate increasing amounts of human effort to measuring model performance on real world tasks. In 5 years, it may be that in order to test whether your model is getting better at making video games, you have to generate 1000 video games at each model checkpoint and have humans play them all. Already, with Kosmos, we fundamentally have to evaluate the system by having humans read and understand the outputs, which is extremely laborious. The resources we dedicate to model evaluation may soon be on par with the resources we dedicate to training the models in the first place; and the market for human evaluators may grow, even as AI increases productivity and decreases the need for human labor in other areas.

Sam Rodriques

The end of benchmarks

Recent Posts