There are many leaderboards available. Just like Search Engine Optimization (SEO), this is a measurement that gets “gamed” and cheated by many model providers. The common issue is that the model “cheats” by being trained on the leaderboard tests. Then, in real applications the results are worse than in the leaderboard because it has not seen the material before.
<aside> 💡
As leaderboards become popular, the AI companies notice. Something that can happen is Overfitting, where the new AI is trained on the leaderboard questions. This is the equivalent of a student training on exams. The student might get improved exam results, but in the real world, outside of exam questions, they do badly. In the same way, Leaderboard results can be gamed to get results that look better. So your personal experience really matters!
</aside>
(Formerly known as lmsys leaderboard)
<aside> 🥷
Locky’s Notes
Has a free chatbot interface that “battles” two LLMs against each other with user chats.
Users upvote their preference, without knowing what LLM is being used.
Has some sophisticated methods to make this process accurate.
Part of a UC Berkley project (prestigious reputation for AI research)- Published paper here: https://arxiv.org/pdf/2403.04132 </aside>
<aside> 💡
Feeling confused already? The good news is, with Expanse, you don’t have to commit to a single model or provider, you can switch as often as you like, even in the middle of a conversation!
</aside>
https://aider.chat/docs/leaderboards/
https://aider.chat/docs/leaderboards/
<aside> 🥷
Locky’s Notes
This is produced by Paul Gauthier, who is responsible for Aider. This leaderboard has widespread respect on technical subreddits and Twitter/X.
Note that the Aider leaderboard does not match the Chatbot Arena leaderboard, as discussed below. </aside>
Everyone has a different way / use-case of using AI, and so the best way to find the best LLM for you is to trial out multiple different models and compare the results.
<aside> 💡
Your personal experience is one of the most important LLM leaderboards.
</aside>