Is progress in Large Language Models slowing down?
- Ville Karlsson

- Mar 6
- 2 min read
Soon it will be 6 years since the revolutionary AI model GPT-3 was released. Following that, it felt like each month brought forth a new breakthrough, and a new headline forecasting the end of work as we know it. But recently, at least according to a loud minority, the deafening roar of progress has finally started to quiet down. But what does the data say? How can we even quantify the quality of the models?
Measuring capability of models is notoriously slippery. In the earlier days, the field relied on static benchmarks, which included tasks like coding challenges, math problems or multiple choice exams. But eventually the answers to those benchmarks started to spread over the internet, which were then absorbed and memorized by the AI models, allowing them to essentially cheat by knowing the contents of the “exam” before taking it.
In recent years, an alternative method to test models emerged; The Chatbot Arena. Instead of static tests, the arena uses a crowdsourced gladiatorial approach: Two models generate an answer to a human prompt and the user votes which one was better. In this way, the model that appears to humans as the better one gets rated higher, giving a reliable measure of perceived ability.
To turn these many votes into a concrete metric, the Chatbot Arena relies on the Elo rating system, the exact same mathematical framework used in competitive chess. In an Elo system, a model's rating is constantly evolving, adjusting after every bout depending on who won and what the rating gap was going into the match. To put these numbers into perspective, an Elo advantage of 100 points means the higher-rated model will win around 65% of the time.
So, looking at the data from the Chatbot Arena, do people really think that the progress is slowing down?
If one looks at the raw numbers from the last two years, the answer is nuanced: yes. The rate of improvement is highly volatile, and depends on periods of breakthroughs. There was a great leap of progress between late 2024 and early 2025, which was the result of widespread adoption of “reasoning” models, like OpenAI’s o1 series, Grok 3, and Gemini 2.5. During this quarter, the Elo scores shot up by nearly 100 points, whereas the current, more stable period is characterized by around 12 points of increase per quarter. As such, the recent history shows that the fast rate of progress is intrinsically tied to unreliable and very difficult to predict breakthrough ideas.
Ultimately, labeling this recent deceleration as a sign of end of progress is premature. History in this field has taught us that plateaus can appear; the GPT-4 era looked like a wall for more than a year until the reasoning models vaulted over it. As such we will need at least another year of data to say whether we are approaching long-lasting stagnation in the performance of Large Language Models.




