__ __ ____ __ /\ \ /\ \ /'\_/`\ /\ _``. /\ \ \ \ \ \ \ \ /\ \ \ \ \/\_\\ \ \___ __ ____ ____ \ \ \ __\ \ \ __\ \ \__\ \ \ \ \/_/_\ \ _ `\ /'__`\ /',__\ /',__\ \ \ \L\ \\ \ \L\ \\ \ \_/\ \ \ \ \L\ \\ \ \ \ \/\ __//\__, `\/\__, `\ \ \____/ \ \____/ \ \_\\ \_\ \ \____/ \ \_\ \_\ \____\/\____/\/\____/ \/___/ \/___/ \/_/ \/_/ \/___/ \/_/\/_/\/____/\/___/ \/___/
__ __ __ __ /\ \ /\ \ /\ "-./ \ \ \ \____ \ \ \____ \ \ \-./\ \ \ \_____\ \ \_____\ \ \_\ \ \_\ \/_____/ \/_____/ \/_/ \/_/ ______ __ __ ______ ______ ______ /\ ___\ /\ \_\ \ /\ ___\ /\ ___\ /\ ___\ \ \ \____ \ \ __ \ \ \ __\ \ \___ \ \ \___ \ \ \_____\ \ \_\ \_\ \ \_____\ \/\_____\ \/\_____\ \/_____/ \/_/\/_/ \/_____/ \/_____/ \/_____/
♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜ ♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟ · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖
GAME OVER
- Outcome: Draw
Can Large Language Models play chess effectively? This leaderboard evaluates both chess skill and instruction following ability.
Prompt an LLM to move based on the board state,
hint it with legal moves, and it can make moves (though some struggle with instruction following)
and even explain its tactics or strategy.
But can LLMs make meaningful moves and win?
Let's test them against a random player (a bot that picks legal moves randomly).
These Foundational Models, with their vast knowledge and reasoning abilities,
should easily defeat a chaos monkey, right?
Let's find out ツ
METRICS:
- Player: Model playing as black against a Random Player.
- Win/Loss: Difference between wins and losses as a percentage of total games (0-100%). This is the primary ranking metric that measures BOTH chess skill AND instruction following ability. A model needs to understand chess strategy AND follow game instructions correctly to score well. 50% represents equal wins/losses, higher scores mean more wins than losses.
- Game Duration: Percentage of maximum possible game length completed before termination (0-100%). This specifically measures instruction following reliability across many moves. 100% indicates the model either reached a natural conclusion (checkmate, stalemate) or the maximum 200 moves without protocol violations. Lower scores show the model struggled to maintain correct communication as the game progressed.
- Tokens: Number of tokens generated per move. Demonstrates the model's verbosity. Lower token counts may indicate efficiency, while higher counts may show more detailed reasoning OR more garbage generation (depending on the overaal rank, reasoning models generate more tokens anre scrore better, weak models can also be verbose yet show poor performance).
NOTES:
- LLMs played as black against a Random Player (as white).
- 30+ game simulations for Random Player vs. LLM.
- Bottom rows in green demonstrate how a Chess Engine (Stockfish v17) fares against a Random Player.
- 1000 simulations for Random Player vs. Chess Engine and Random vs. Random.
- The default sorting is by Win/Loss (DESC), Game Duration (DESC) and Mistakes (ASC).