__ __ ____ __ /\ \ /\ \ /'\_/`\ /\ _``. /\ \ \ \ \ \ \ \ /\ \ \ \ \/\_\\ \ \___ __ ____ ____ \ \ \ __\ \ \ __\ \ \__\ \ \ \ \/_/_\ \ _ `\ /'__`\ /',__\ /',__\ \ \ \L\ \\ \ \L\ \\ \ \_/\ \ \ \ \L\ \\ \ \ \ \/\ __//\__, `\/\__, `\ \ \____/ \ \____/ \ \_\\ \_\ \ \____/ \ \_\ \_\ \____\/\____/\/\____/ \/___/ \/___/ \/_/ \/_/ \/___/ \/_/\/_/\/____/\/___/ \/___/
__ __ __ __ /\ \ /\ \ /\ "-./ \ \ \ \____ \ \ \____ \ \ \-./\ \ \ \_____\ \ \_____\ \ \_\ \ \_\ \/_____/ \/_____/ \/_/ \/_/ ______ __ __ ______ ______ ______ /\ ___\ /\ \_\ \ /\ ___\ /\ ___\ /\ ___\ \ \ \____ \ \ __ \ \ \ __\ \ \___ \ \ \___ \ \ \_____\ \ \_\ \_\ \ \_____\ \/\_____\ \/\_____\ \/_____/ \/_/\/_/ \/_____/ \/_____/ \/_____/
♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜ ♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟ · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖
GAME OVER
- Outcome: Draw
Can Large Language Models play chess? Let's find out ツ
This leaderboard evaluates chess skill and instruction following
in an agentic setting:
LLMs engage in multi-turn dialogs where they are presented with a choice of actions (e.g., "get board" or "make move") when playing against
an opponent (Random Player or Chess Engine).
In 2024, we began with a chaos monkey baseline — a Random Player that chooses legal moves at random.
At the time, most models could barely compete and lost either due to an inability to follow game instructions
(i.e., hallucinating illegal moves or taking incorrect actions) or by dragging the game to the 200-move limit because they
couldn't win.
In 2025, more capable reasoning models nailed both instruction following and chess skill.
We've added the Komodo Dragon Chess Engine as a more capable opponent, which is also Elo-rated on
chess.com. This allowed us to anchor the results to a real-world rating scale and compute an Elo rating for each model.
METRICS:
- Player: Model name (playing as Black). Models that also played vs Dragon are marked with an asterisk in superscript (e.g., 3*).
- Elo: Estimated Elo anchored by Dragon skill levels and calibrated Random. We solve a 1D MLE over aggregated blocks (opponent Elo, wins, draws, losses) and report ±95% CI. When both Random and Dragon data exist, they are combined Empty Elo appears for extreme 100% win/loss or no anchored games.
- Win/Loss: Difference between wins and losses as a percentage of total games (0-100%); reflects BOTH chess skill and instruction following. 50% ≈ equal wins/losses.
- Game Duration: Share of maximum game length completed (0-100%); measures instruction-following stability across many moves.
- Tokens: Completion tokens per move; verbosity/efficiency signal.
ARRANGEMENT & SOURCES:
- Primary sorting: Elo (DESC), then Win/Loss (DESC), Game Duration (DESC), Tokens (ASC).
- Data sources mix Random-vs-LLM and Dragon-vs-LLM games. Dragon levels map to Elo and provide the anchor; Random is first calibrated vs Dragon and then used as an opponent for many models.
- Elo ratings are not comparable across player pools, i.e. you can not compare chess.com Elo to FIDE Elo
- Chess.com references used for context (as of Sep 2025): Rapid Leaderboard (Elo pool), Magnus Carlsen stats, and Elo explanation & player classes.