LLM Leaderboards

There are many leaderboards available. Just like Search Engine Optimization (SEO), this is a measurement that gets “gamed” and cheated by many model providers. The common issue is that the model “cheats” by being trained on the leaderboard tests. Then, in real applications the results are worse than in the leaderboard because it has not seen the material before.

<aside> 💡

As leaderboards become popular, the AI companies notice. Something that can happen is Overfitting, where the new AI is trained on the leaderboard questions. This is the equivalent of a student training on exams. The student might get improved exam results, but in the real world, outside of exam questions, they do badly. In the same way, Leaderboard results can be gamed to get results that look better. So your personal experience really matters!

</aside>

Our favourite leaderboards are:

1. Chatbot Arena

lmarena.ai

(Formerly known as lmsys leaderboard)

https://lmarena.ai/

<aside> 🥷

Notes

Has a free chatbot interface that “battles” two LLMs against each other with user chats.
Users upvote their preference, without knowing what LLM is being used.
Has some sophisticated methods to make this process accurate.
Part of a UC Berkley project (prestigious reputation for AI research)- Published paper here: https://arxiv.org/pdf/2403.04132 </aside>

<aside> 💡

Feeling confused already? The good news is, with Expanse, you don’t have to commit to a single model or provider, you can switch as often as you like, even in the middle of a conversation!

</aside>

2. Aider leaderboard (for coding)

https://aider.chat/docs/leaderboards/

<aside> 🥷

Notes

This is produced by Paul Gauthier, who is responsible for Aider. This leaderboard has widespread respect on technical subreddits and Twitter/X.
Note that the Aider leaderboard does not match the Chatbot Arena leaderboard, as discussed below. </aside>

3. Personal Experience

Everyone has a different way / use-case of using AI, and so the best way to find the best LLM for you is to trial out multiple different models and compare the results.

For example, Grok’s DeepSearch is good for realtime info due to its access to X posts (formerly known as tweets). There are many other search products, but based on personally trying this, I discovered how useful it is for realtime information. At the time I did not see this information on any leaderboard, so this was only discovered through personal testing.

<aside> 💡

Your personal experience is one of the most important LLM leaderboards.

</aside>

Our favourite leaderboards are:

1. Chatbot Arena

2. Aider leaderboard (for coding)

3. Personal Experience

Specialized use cases