LLM Chess Leaderboard
Simulating chess games between a Random Player and an LLM. Evaluating Chat Models' (1) chess proficiency and (2) instruction following abilities

 __       __                    ____     __                               
/\ \     /\ \       /'\_/`\    /\  _``. /\ \                              
\ \ \    \ \ \     /\      \   \ \ \/\_\\ \ \___      __    ____    ____  
 \ \ \  __\ \ \  __\ \ \__\ \   \ \ \/_/_\ \  _ `\  /'__`\ /',__\  /',__\ 
  \ \ \L\ \\ \ \L\ \\ \ \_/\ \   \ \ \L\ \\ \ \ \ \/\  __//\__, `\/\__, `\
   \ \____/ \ \____/ \ \_\\ \_\   \ \____/ \ \_\ \_\ \____\/\____/\/\____/
    \/___/   \/___/   \/_/ \/_/    \/___/   \/_/\/_/\/____/\/___/  \/___/ 

        
         __         __         __    __                        
        /\ \       /\ \       /\ "-./  \                       
        \ \ \____  \ \ \____  \ \ \-./\ \                      
         \ \_____\  \ \_____\  \ \_\ \ \_\                     
          \/_____/   \/_____/   \/_/  \/_/                                                                                
 ______     __  __     ______     ______     ______    
/\  ___\   /\ \_\ \   /\  ___\   /\  ___\   /\  ___\   
\ \ \____  \ \  __ \  \ \  __\   \ \___  \  \ \___  \  
 \ \_____\  \ \_\ \_\  \ \_____\  \/\_____\  \/\_____\ 
  \/_____/   \/_/\/_/   \/_____/   \/_____/   \/_____/ 
        
Random Player (White)
♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜
♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙
♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖

GAME OVER

- Outcome: Draw
- Max moves reached: 200
- Material White: 16
- Material Black: 18
GPT-4o Mini (Black)

Can a Large Language Model play chess? Prompt it to move based on the board state, hint it with legal moves, and it can make moves (though some struggle with instruction following) and even explain its tactics or strategy.

But can LLMs make meaningful moves and win? Let’s test them against a random player (a bot that picks legal moves randomly). These Foundational Models, with their vast knowledge and reasoning abilities, should easily defeat a chaos monkey, right?

Let's find out ツ

#   Player   Wins/Diff   Draws   Mistakes   Tokens  

METRICS:

- Player: Model playing as black against a Random Player.
- Wins-Losses: Difference between wins and losses as a percentage of total games. This metric highlights LLMs' proficiency in chess by showing their ability to win more games while losing fewer.
- Wins (old leaderboard): How often the player scored a win (due to checkmate or the opponent failing to make a move). This metric reflects LLMs' proficiency in chess by focusing on their success rate in achieving victories.
- Avg Moves: Average number of moves per game. For stronger winning models, a lower game duration indicates higher capability—lower is better. For weaker losing models, a higher value is better as it means the model can stay in the game longer without interrupting the game loop due to a mistake.
- Draws (old leaderboard): Percentage of games without a winner (e.g., reaching the maximum number of 200 moves or stalemate). Displays weaks' LLMs' proficiency in chess if it can't win.
- Mistakes: Number of LLM erroneous replies per 1000 moves - how often did an LLM fail to follow the instructions and make a move. E.g., due to hallucinations, picking illegal moves, not conforming to the communication protocol. Shows the model's instruction-following capabilities and hallucination resistance.
- Tokens: Number of tokens generated per move. Demonstrates the model's verbosity.

NOTES:

- LLMs played as black against a Random Player (as white).
- 30+ game simulations for Random Player vs. LLM.
- Bottom rows in green demonstrate how a Chess Engine (Stockfish v17) fares against a Random Player.
- 1000 simulations for Random Player vs. Chess Engine and Random vs. Random.
- You see it right, LLMs scored 0 wins. No longer the case, o1-mini being the 1st LLM scroging wins
- Using Draws instead of Wins to evaluate LLMs' chess proficiency.
- The default soritng is by Wins-Losses DESC, Draws DESC and Mistakes ASC (new leaderboard) and for older leaderboard by Wins DESC, Draws DESC and Mistakes ASC

- Strong models (those ones winning) are judged (in Chess proficiency) by % Won, weak ones - by % Draws
- Mistakes metric gives an evaluation of LLMs' instruction-following capabilities and resistance to hallucinations (making up non-legal moves while having a list of legal moves provided in the prompt).
- Sort by Mistakes column and get a ranking by instruction-following ability (models with the least mistakes being better)

GitHub Icon Project's GitHub