Study accuses LM Arena of helping top AI labs game its benchmark
Key Findings from the Study
The authors of the study report that LM Arena enabled select AI labs to run multiple private tests using several model variations. They claim that while most companies were restricted to public scores, a handful were allowed to test their models more extensively. According to Sara Hooker, VP of AI Research at Cohere and a co-author of the paper, “Only a handful of companies were told that this private testing was available, and the amount of testing they received far outstripped that of others. This is gamification.”
The study highlights that Chatbot Arena, which started as an academic project at UC Berkeley in 2023, works by pitting two AI models against each other in a head-to-head comparison, with users voting on which answer they prefer. This user-driven approach results in scores that determine a model’s position on the leaderboard.
Read also :Â
Pinterest launches new tools to fight AI slop
How the Benchmarking Process Works
Chatbot Arena has become a key benchmark in the AI community because it allows users to vote on which model performs better in direct comparisons. Votes earned over time contribute to a model’s final score. Although many companies participate, the study suggests that the system may not be entirely fair if some players are given additional private testing opportunities.
- Special access: Certain companies were able to conduct private tests on multiple model variants.
- Score skewing: Only high-performing scores were made public, potentially favoring companies with extra testing opportunities.
- Leaderboard influence: Enhanced exposure through additional battles could significantly boost a model’s ranking.
Industry Response and LM Arena’s Comments
In response to the study, LM Arena co-founder and UC Berkeley Professor Ion Stoica dismissed the findings as having “inaccuracies” and “questionable analysis.” The organization maintained that its evaluation process is fair and community-driven. In a statement shared on X, LM Arena reiterated their commitment to transparency by inviting all model providers to participate, noting that differing numbers of test submissions do not necessarily equate to unfair treatment.
LM Arena further explained that detailed information about pre-release model testing had been available since March 2024 via their policy update and subsequent posts.
Evidence of Favored Access
The study reveals evidence suggesting that certain AI companies, including Meta, OpenAI, and Google, benefited from enhanced sampling. For example, the research indicates that Meta was able to test 27 different model variations privately between January and March before the launch of its Llama 4 model. Despite the extensive testing, only one high-ranking score was publicized, which the authors argue inflated the model’s apparent performance.
Additional analysis suggested that increased testing could boost a model’s performance on another LM Arena benchmark, Arena Hard, by as much as 112%. However, LM Arena clarified on X that performance on Arena Hard is not directly linked with the public Chatbot Arena scores.
Study Limitations and Next Steps
The researchers acknowledged that their approach had limitations. They categorized AI models based on self-identification, which may not always be a foolproof method. Despite these methodological challenges, LM Arena did not formally dispute the preliminary findings when approached by the study team.
None of the companies mentioned in the study – Meta, Google, OpenAI, and Amazon – provided immediate comments when reached for further clarification.
Read also :Â
JetBrains releases Mellum, an open AI coding model
Calls for Greater Transparency
The study concludes that LM Arena should consider changes to ensure fairer competition on Chatbot Arena. The recommendations include setting a transparent cap on the number of private tests any AI lab can conduct and publicly disclosing all scores from these tests. While LM Arena has rejected the need for such measures regarding pre-release models, it has shown openness to refining its sampling algorithm to offer more equal exposure for all competing models.
An earlier episode saw Meta strategically optimizing one version of its Llama 4 model for “conversationality” to secure a top leaderboard spot. Despite these tactics, the unoptimized model later performed considerably worse, sparking further debate about benchmark transparency and fairness.
Looking Ahead
The findings add to growing scrutiny over the reliability of private AI benchmarks. With LM Arena now transitioning towards launching as a full-fledged company, the pressure is on to prove that its evaluation processes can remain unbiased and equitable, even when major industry players are involved.
For more technical details and to review the original research, you can read the full paper on arXiv.
Update (4/30/25, 9:35pm PT): A previous iteration of this article featured comments from a Google DeepMind engineer regarding part of the methodology. While the engineer confirmed that Google submitted 10 models for pre-release testing between January and March, it was clarified that within the open source team only one model was provided.
Read also :Â
Startups launch products to catch people using AI cheating app Cluely