__ __ ____ __
/\ \ /\ \ /'\_/`\ /\ _``. /\ \
\ \ \ \ \ \ /\ \ \ \ \/\_\\ \ \___ __ ____ ____
\ \ \ __\ \ \ __\ \ \__\ \ \ \ \/_/_\ \ _ `\ /'__`\ /',__\ /',__\
\ \ \L\ \\ \ \L\ \\ \ \_/\ \ \ \ \L\ \\ \ \ \ \/\ __//\__, `\/\__, `\
\ \____/ \ \____/ \ \_\\ \_\ \ \____/ \ \_\ \_\ \____\/\____/\/\____/
\/___/ \/___/ \/_/ \/_/ \/___/ \/_/\/_/\/____/\/___/ \/___/
__ __ __ __
/\ \ /\ \ /\ "-./ \
\ \ \____ \ \ \____ \ \ \-./\ \
\ \_____\ \ \_____\ \ \_\ \ \_\
\/_____/ \/_____/ \/_/ \/_/
______ __ __ ______ ______ ______
/\ ___\ /\ \_\ \ /\ ___\ /\ ___\ /\ ___\
\ \ \____ \ \ __ \ \ \ __\ \ \___ \ \ \___ \
\ \_____\ \ \_\ \_\ \ \_____\ \/\_____\ \/\_____\
\/_____/ \/_/\/_/ \/_____/ \/_____/ \/_____/
Random Player (White)
♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜
♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙
♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖
GAME OVER
- Outcome: Draw
- Max moves reached: 200
- Material White: 16
- Material Black: 18
GPT-4o Mini (Black)
Can Large Language Models play chess effectively? This leaderboard evaluates both chess skill and instruction following ability.
Prompt an LLM to move based on the board state,
hint it with legal moves, and it can make moves (though some struggle with instruction following)
and even explain its tactics or strategy.
But can LLMs make meaningful moves and win?
Let's test them against a random player (a bot that picks legal moves randomly).
These Foundational Models, with their vast knowledge and reasoning abilities,
should easily defeat a chaos monkey, right?
Let's find out ツ
METRICS:
- Player: Model playing as black against a Random Player.
- Win/Loss: Difference between wins and losses as a percentage of total games (0-100%). This is the primary ranking metric that measures BOTH chess skill AND instruction following ability. A model needs to understand chess strategy AND follow game instructions correctly to score well. 50% represents equal wins/losses, higher scores mean more wins than losses.
- Game Duration: Percentage of maximum possible game length completed before termination (0-100%). This specifically measures instruction following reliability across many moves. 100% indicates the model either reached a natural conclusion (checkmate, stalemate) or the maximum 200 moves without protocol violations. Lower scores show the model struggled to maintain correct communication as the game progressed.
- Tokens: Number of tokens generated per move. Demonstrates the model's verbosity. Lower token counts may indicate efficiency, while higher counts may show more detailed reasoning OR more garbage generation (depending on the overaal rank, reasoning models generate more tokens anre scrore better, weak models can also be verbose yet show poor performance).
NOTES:
- LLMs played as black against a Random Player (as white).
- 30+ game simulations for Random Player vs. LLM.
- Bottom rows in green demonstrate how a Chess Engine (Stockfish v17) fares against a Random Player.
- 1000 simulations for Random Player vs. Chess Engine and Random vs. Random.
- The default sorting is by Win/Loss (DESC), Game Duration (DESC) and Mistakes (ASC).
MATRIX VISUALIZATION:
This plots shows LLM chess players based on two key metrics:
- X-Axis: Game Duration (0-100%) - Shows how well models maintain correct communication protocols throughout the game. Higher values indicate better instruction following ability.
- Y-Axis: Win Rate (0-100%) - The metric is less strict than Win/Loss (Non-Interrupted) used in the leaderboard as it ignores technichal losses due to poor instruction following. Higher values indicate better chess strategy and decision making.
INTERPRETATION:
- Top-Right: Models with both excellent chess skill and instruction following.
- Top-Left: Models with good chess skill but struggle to maintain communication protocol.
- Bottom-Right: Models that follow instructions well but make poor chess moves.
- Bottom-Left: Models that struggle with both chess strategy and following instructions.
Libraries and Dependencies Used:
- chess: A Python library for handling chess game rules and basic operations, including board representation, legal move evaluation, and game state evaluation. This is not a chess engine running the actual calculation of the best move.
- Microsoft Autogen is used as a backbone for LLM communication. It also implements the interaction between a Chess Board and custom agents like GameAgent, RandomPlayerAgent, AutoReplyAgent, and others for simulating different player types.
- Stockfish - the chess engine doing the actual best move calculation, used as a reference to demonstrate what a real chess player's performance is.
Workflow:
1. The game is initialized with a chess board and two players: a Random Player (as white) and an LLM (as black).
2. The game loop runs until a termination condition is met, such as checkmate, stalemate, or reaching the maximum number of moves.
3. Each player takes turns making a move. The Random Player selects a move randomly from the list of legal moves.
4. The LLM is prompted to make a move using a structured dialog, which includes actions like getting the current board state, retrieving legal moves, and making a move.
5. The game state is updated after each move, and the board is visualized if enabled.
6. Game statistics are generated and stored at the end of the game.
Dialog:
Here is an example of a dialog that prompts a model to make a move. UCI format refers to Universal Chess Interface notation, which uses source and destination squares (e.g., e2e4 means move from e2 to e4):
Proxy (to Player_Black):
You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of the following actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'make_move <UCI formatted move>' when you are ready to complete your turn (e.g., 'make_move e2e4')
Respond with the action.
Player_Black (to Proxy):
get_current_board
Proxy (to Player_Black):
♞ ♝ ♛ ♚ ♝ ♞ ♜ ♜
♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙
♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖
Player_Black (to Proxy):
get_legal_moves
Proxy (to Player_Black):
a7a6, b7b6, c7c6, d7d6, e7e6, f7f6, g7g6, h7h6, a7a5, b7b5, c7c5, d7d5, e7e5, f7f5, g7g5, h7h5
Player_Black (to Proxy):
make_move e7e5
Proxy (to Player_Black):
Move made, switching player
1 move = 1 dialog. The dialog is limited to 10 turns (10 pairs of Proxy/Agent Request/Response messages). A maximum of 3 mistakes (not conforming to
communication notation, picking wrong action or wrong move) are allowed per dialog.
The game is terminated and LLM is given a loss if the max turns OR number of LLM mistakes is reached in the given dialog. Models that excel at both chess strategy and following instructions precisely will perform best.
Project's GitHub