LMSYS Chatbot Arena – Your Guide to Large Language Model Comparisons

Introduction

Each week witnesses the debut of newer and more advanced Large Language Models (LLMs), all vying to outshine their predecessors. But how can one stay abreast of these rapid – fire developments? The answer lies in the LMSYS Chatbot Arena, an innovative platform crafted by the Large Model Systems Organization, a collective of students and teachers from UC Berkeley, UCSD, and CMU. This platform simplifies the comparison and evaluation of various LLMs, allowing users to test and rate them, making it a hub for anyone interested in the latest LLM releases and their relative performance.

LMSYS Leaderboard

The LMSYS leaderboard ranks LLMs using a Bradley – Terry model, presenting rankings on an Elo scale. It relies on human pairwise comparisons for its rankings. As of April 26, 2024, it features 91 different models and has amassed over 800,000 human pairwise comparisons. Models are ranked based on performance in categories like coding and long user queries, and the rankings are continuously updated.

Top 10 LLMs

The top – trending models according to Arena Elo Ratings include GPT – 4 – Turbo by Open AI, GPT – 4 – 1106 – preview by Open AI, Claude 3 Opus by Anthropic, Gemini 1.5 Pro API – 0409 – Preview by Google, and others. Open AI seems to be leading the race for the best LLMs so far. The term “preview” in front of some models indicates a version available for testing before the official release, much like a beta software version.

Difference between Open Source vs Closed Source LLMs

Llama 3 is often hailed as the best open – source LLM, yet GPT – 4 Turbo tops the overall rankings. This is because the rankings include both open – source and closed – source LLMs. The leaderboard’s last column indicates the license type, categorizing models into open – source and closed – source.

Open Source LLMs

The code of Open Source LLMs is publicly accessible, fostering a collaborative development environment. Some models, like Mixtral – 8x22b – Instruct and Zephyr – ORPO, have permissive licenses for unrestricted use. Others, such as Command R+ and Llama 3, may have license restrictions, limiting commercial use or modifications.

Closed Source LLMs

Closed – Source LLMs are not publicly available and require permission or licensing to use. They are typically developed by commercial entities, like OpenAI’s GPT – 4 series, Google’s Gemini series, and Anthropic’s Claude series. Open – source LLMs offer transparency and collaboration, while closed – source LLMs prioritize control and may offer a more polished user experience.

How does LMSYS Arena Works?

The LMSYS platform evaluates LLMs by collecting user dialogue data. Users can compare two LLMs side – by – side on a given task, like writing a poem or answering a question, and vote on the better response. The platform then uses these votes to update the rankings of the LLMs based on the Bradley – Terry model.

LMSYS Leaderboard Evaluation System

The LMSYS leaderboard uses the Elo rating system and the Bradley – Terry model to rate LLMs. The Elo system scores LLMs based on performance, similar to its use in chess. The Bradley – Terry model provides a more in – depth look by considering task difficulty. In the LMSYS Chatbot Arena, LLMs are like game players, with scores that change based on wins and losses, accurately reflecting their current strengths.

Conclusion

This article aimed to help you understand the LMSYS leaderboard and keep track of LLM developments. The LMSYS Chatbot Arena, with its user – driven ranking system and detailed scoring methods, is an excellent place to assess LLM performance. Better understanding of these models can lead to more effective real – life usage. If you know of other resources for staying updated in Generative AI, share them in the comments.