LLM Chess Leaderboard
Simulating chess games between a Random Player and an LLM. Evaluating Chat Models' (1) chess proficiency and (2) instruction following abilities

 __       __                    ____     __                               
/\ \     /\ \       /'\_/`\    /\  _``. /\ \                              
\ \ \    \ \ \     /\      \   \ \ \/\_\\ \ \___      __    ____    ____  
 \ \ \  __\ \ \  __\ \ \__\ \   \ \ \/_/_\ \  _ `\  /'__`\ /',__\  /',__\ 
  \ \ \L\ \\ \ \L\ \\ \ \_/\ \   \ \ \L\ \\ \ \ \ \/\  __//\__, `\/\__, `\
   \ \____/ \ \____/ \ \_\\ \_\   \ \____/ \ \_\ \_\ \____\/\____/\/\____/
    \/___/   \/___/   \/_/ \/_/    \/___/   \/_/\/_/\/____/\/___/  \/___/ 

        
         __         __         __    __                        
        /\ \       /\ \       /\ "-./  \                       
        \ \ \____  \ \ \____  \ \ \-./\ \                      
         \ \_____\  \ \_____\  \ \_\ \ \_\                     
          \/_____/   \/_____/   \/_/  \/_/                                                                                
 ______     __  __     ______     ______     ______    
/\  ___\   /\ \_\ \   /\  ___\   /\  ___\   /\  ___\   
\ \ \____  \ \  __ \  \ \  __\   \ \___  \  \ \___  \  
 \ \_____\  \ \_\ \_\  \ \_____\  \/\_____\  \/\_____\ 
  \/_____/   \/_/\/_/   \/_____/   \/_____/   \/_____/ 
        
Random Player (White)
♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜
♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙
♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖

GAME OVER

- Outcome: Draw
- Max moves reached: 200
- Material White: 16
- Material Black: 18
GPT-4o Mini (Black)

Can Large Language Models play chess? Let's find out ツ

This leaderboard evaluates chess skill and instruction following in an agentic setting: LLMs engage in multi-turn dialogs where they are presented with a choice of actions (e.g., "get board" or "make move") when playing against an opponent (Random Player or Chess Engine).

In 2024, we began with a chaos monkey baseline — a Random Player that chooses legal moves at random. At the time, most models could barely compete and lost either due to an inability to follow game instructions (i.e., hallucinating illegal moves or taking incorrect actions) or by dragging the game to the 200-move limit because they couldn't win.

In 2025, more capable reasoning models nailed both instruction following and chess skill. We've added the Komodo Dragon Chess Engine as a more capable opponent, which is also Elo-rated on chess.com. This allowed us to anchor the results to a real-world rating scale and compute an Elo rating for each model.

Select columns (max 7)

METRICS:

- Player: Model name (playing as Black). Models that also played vs Dragon are marked with an asterisk in superscript (e.g., 3*).
- Elo: Estimated Elo anchored by Dragon skill levels and calibrated Random. We solve a 1D MLE over aggregated blocks (opponent Elo, wins, draws, losses) and report ±95% CI. When both Random and Dragon data exist, they are combined Empty Elo appears for extreme 100% win/loss or no anchored games.
- Win/Loss: Difference between wins and losses as a percentage of total games (0-100%); reflects BOTH chess skill and instruction following. 50% ≈ equal wins/losses.
- Game Duration: Share of maximum game length completed (0-100%); measures instruction-following stability across many moves.
- Tokens: Completion tokens per move; verbosity/efficiency signal.

ARRANGEMENT & SOURCES:

- Primary sorting: Elo (DESC), then Win/Loss (DESC), Game Duration (DESC), Tokens (ASC).
- Data sources mix Random-vs-LLM and Dragon-vs-LLM games. Dragon levels map to Elo and provide the anchor; Random is first calibrated vs Dragon and then used as an opponent for many models.
- Elo ratings are not comparable across player pools, i.e. you can not compare chess.com Elo to FIDE Elo
- Chess.com references used for context (as of Sep 2025): Rapid Leaderboard (Elo pool), Magnus Carlsen stats, and Elo explanation & player classes.

GitHub Icon Project's GitHub