__ __ ____ __
/\ \ /\ \ /'\_/`\ /\ _``. /\ \
\ \ \ \ \ \ /\ \ \ \ \/\_\\ \ \___ __ ____ ____
\ \ \ __\ \ \ __\ \ \__\ \ \ \ \/_/_\ \ _ `\ /'__`\ /',__\ /',__\
\ \ \L\ \\ \ \L\ \\ \ \_/\ \ \ \ \L\ \\ \ \ \ \/\ __//\__, `\/\__, `\
\ \____/ \ \____/ \ \_\\ \_\ \ \____/ \ \_\ \_\ \____\/\____/\/\____/
\/___/ \/___/ \/_/ \/_/ \/___/ \/_/\/_/\/____/\/___/ \/___/
__ __ __ __
/\ \ /\ \ /\ "-./ \
\ \ \____ \ \ \____ \ \ \-./\ \
\ \_____\ \ \_____\ \ \_\ \ \_\
\/_____/ \/_____/ \/_/ \/_/
______ __ __ ______ ______ ______
/\ ___\ /\ \_\ \ /\ ___\ /\ ___\ /\ ___\
\ \ \____ \ \ __ \ \ \ __\ \ \___ \ \ \___ \
\ \_____\ \ \_\ \_\ \ \_____\ \/\_____\ \/\_____\
\/_____/ \/_/\/_/ \/_____/ \/_____/ \/_____/
♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜ ♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟ · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖
GAME OVER
- Outcome: Draw
Can Large Language Models play chess? Let's find out ツ
This leaderboard evaluates chess skill and instruction following
in an agentic setting:
LLMs engage in multi-turn dialogs where they are presented with a choice of actions (e.g., "get board" or
"make move") when playing against
an opponent (Random Player or Chess Engine).
In 2024, we began with a chaos monkey baseline — a Random Player that chooses legal moves at
random.
At the time, most models could barely compete and lost either due to an inability to follow game
instructions
(i.e., hallucinating illegal moves or taking incorrect actions) or by dragging the game to the 200-move
limit because they
couldn't win.
In 2025, more capable reasoning models nailed both instruction following and chess skill.
We've added the Komodo Dragon Chess Engine as a more capable opponent, which is also
Elo-rated on
chess.com. This allowed us to anchor the results to a real-world rating scale and compute an Elo rating for
each model.
METRICS:
- Player: Model name (playing as Black). Models that also played vs Dragon are marked
with an asterisk in superscript (e.g., 3*).
- Elo: Estimated Elo anchored by Dragon skill levels and calibrated Random. We solve a
1D MLE over aggregated blocks (opponent Elo, wins, draws, losses) and report ±95% CI. When both Random
and Dragon data exist, they are combined. Empty Elo appears for extreme 100% win/loss or no anchored
games.
- Win/Loss: Difference between wins and losses as a percentage of total games (0-100%);
reflects BOTH chess skill and instruction following. 50% ≈ equal wins/losses.
- Game Duration: Share of maximum game length completed (0-100%); measures
instruction-following stability across many moves.
- Tokens: Completion tokens per move; verbosity/efficiency signal.
ARRANGEMENT & SOURCES:
- Primary sorting: Elo (DESC), then Win/Loss (DESC), Game Duration (DESC), Tokens (ASC).
- Data sources mix Random-vs-LLM and Dragon-vs-LLM games. Dragon levels map to Elo and provide the
anchor; Random is first calibrated vs Dragon and then used as an opponent for many models.
- Elo ratings are not comparable across player pools, i.e. you can not compare chess.com Elo to FIDE
Elo
- Chess.com references used for context (as of Sep 2025): Rapid Leaderboard (Elo
pool), Magnus Carlsen
stats, and Elo explanation &
player classes.