LLM Chess Leaderboard
Simulating chess games between a Random Player and an LLM. Evaluating Chat Models' (1) chess proficiency and (2) instruction following abilities

 __       __                    ____     __                               
/\ \     /\ \       /'\_/`\    /\  _``. /\ \                              
\ \ \    \ \ \     /\      \   \ \ \/\_\\ \ \___      __    ____    ____  
 \ \ \  __\ \ \  __\ \ \__\ \   \ \ \/_/_\ \  _ `\  /'__`\ /',__\  /',__\ 
  \ \ \L\ \\ \ \L\ \\ \ \_/\ \   \ \ \L\ \\ \ \ \ \/\  __//\__, `\/\__, `\
   \ \____/ \ \____/ \ \_\\ \_\   \ \____/ \ \_\ \_\ \____\/\____/\/\____/
    \/___/   \/___/   \/_/ \/_/    \/___/   \/_/\/_/\/____/\/___/  \/___/ 

        
         __         __         __    __                        
        /\ \       /\ \       /\ "-./  \                       
        \ \ \____  \ \ \____  \ \ \-./\ \                      
         \ \_____\  \ \_____\  \ \_\ \ \_\                     
          \/_____/   \/_____/   \/_/  \/_/                                                                                
 ______     __  __     ______     ______     ______    
/\  ___\   /\ \_\ \   /\  ___\   /\  ___\   /\  ___\   
\ \ \____  \ \  __ \  \ \  __\   \ \___  \  \ \___  \  
 \ \_____\  \ \_\ \_\  \ \_____\  \/\_____\  \/\_____\ 
  \/_____/   \/_/\/_/   \/_____/   \/_____/   \/_____/ 
        
Random Player (White)
♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜
♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙
♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖

GAME OVER

- Outcome: Draw
- Max moves reached: 200
- Material White: 16
- Material Black: 18
GPT-4o Mini (Black)

Can a Large Language Model play chess? Prompt it to move based on the board state, hint it with legal moves, and it can make moves (though some struggle with instruction following) and even explain its tactics or strategy.

But can LLMs make meaningful moves and win? Let’s test them against a random player (a bot that picks legal moves randomly). These Foundational Models, with their vast knowledge and reasoning abilities, should easily defeat a chaos monkey, right?

Let's find out ツ

#   Player   Wins   Draws   Mistakes   Tokens  

METRICS:

- Player: Model playing as black against a Random Player.
- Wins: How often the player scored a win (due to checkmate or the opponent failing to make a move). Displays LLMs' proficiency in chess.
- Draws: Percentage of games without a winner (e.g., reaching the maximum number of 200 moves or stalemate). Displays weaks' LLMs' proficiency in chess if it can't win.
- Mistakes: Number of LLM erroneous replies per 1000 moves - how often did an LLM fail to follow the instructions and make a move. E.g., due to hallucinations, picking illegal moves, not conforming to the communication protocol. Shows the model's instruction-following capabilities and hallucination resistance.
- Tokens: Number of tokens generated per move. Demonstrates the model's verbosity.

NOTES:

- LLMs played as black against a Random Player (as white).
- 30+ game simulations for Random Player vs. LLM.
- Bottom rows in green demonstrate how a Chess Engine (Stockfish v17) fares against a Random Player.
- 1000 simulations for Random Player vs. Chess Engine and Random vs. Random.
- You see it right, LLMs scored 0 wins. No longer the case, o1-mini being the 1st LLM scroging wins
- Using Draws instead of Wins to evaluate LLMs' chess proficiency.
- The default soritng is by Wins DESC, Draws DESC and Mistakes ASC
- Strong models (those ones winning) are judged (in Chess proficiency) by % Won, weak ones - by % Draws
- Mistakes metric gives an evaluation of LLMs' instruction-following capabilities and resistance to hallucinations (making up non-legal moves while having a list of legal moves provided in the prompt).
- Sort by Mistakes column and get a ranking by instruction-following ability (models with the least mistakes being better)

GitHub Icon Project's GitHub