LLM Chess Leaderboard
Simulating chess games between a Random Player and an LLM. Evaluating Chat Models' (1) chess proficiency and (2) instruction following abilities
__ __ ____ __ /\ \ /\ \ /'\_/`\ /\ _``. /\ \ \ \ \ \ \ \ /\ \ \ \ \/\_\\ \ \___ __ ____ ____ \ \ \ __\ \ \ __\ \ \__\ \ \ \ \/_/_\ \ _ `\ /'__`\ /',__\ /',__\ \ \ \L\ \\ \ \L\ \\ \ \_/\ \ \ \ \L\ \\ \ \ \ \/\ __//\__, `\/\__, `\ \ \____/ \ \____/ \ \_\\ \_\ \ \____/ \ \_\ \_\ \____\/\____/\/\____/ \/___/ \/___/ \/_/ \/_/ \/___/ \/_/\/_/\/____/\/___/ \/___/
__ __ __ __ /\ \ /\ \ /\ "-./ \ \ \ \____ \ \ \____ \ \ \-./\ \ \ \_____\ \ \_____\ \ \_\ \ \_\ \/_____/ \/_____/ \/_/ \/_/ ______ __ __ ______ ______ ______ /\ ___\ /\ \_\ \ /\ ___\ /\ ___\ /\ ___\ \ \ \____ \ \ __ \ \ \ __\ \ \___ \ \ \___ \ \ \_____\ \ \_\ \_\ \ \_____\ \/\_____\ \/\_____\ \/_____/ \/_/\/_/ \/_____/ \/_____/ \/_____/
♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜ ♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟ · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖
GAME OVER
- Outcome: Draw
Can a Large Language Model play chess? Prompt it to move based on the board state,
hint it with legal moves, and it can make moves (though some struggle with instruction following)
and even explain its tactics or strategy.
But can LLMs make meaningful moves and win?
Let’s test them against a random player (a bot that picks legal moves randomly).
These Foundational Models, with their vast knowledge and reasoning abilities,
should easily defeat a chaos monkey, right?
Let's find out ツ
# | Player | Wins | Draws | Mistakes | Tokens |
---|
METRICS:
- Player: Model playing as black against a Random Player.
- Wins: How often the player scored a win (due to checkmate or the opponent failing to make a move). Displays LLMs' proficiency in chess.
- Draws: Percentage of games without a winner (e.g., reaching the maximum number of 200 moves or stalemate). Displays weaks' LLMs' proficiency in chess if it can't win.
- Mistakes: Number of LLM erroneous replies per 1000 moves - how often did an LLM fail to follow the instructions
and make a move. E.g., due to hallucinations, picking illegal moves, not conforming to the communication protocol. Shows the model's instruction-following capabilities and hallucination resistance.
- Tokens: Number of tokens generated per move. Demonstrates the model's verbosity.
NOTES:
- LLMs played as black against a Random Player (as white).
- 30+ game simulations for Random Player vs. LLM.
- Bottom rows in green demonstrate how a Chess Engine (Stockfish v17) fares against a Random Player.
- 1000 simulations for Random Player vs. Chess Engine and Random vs. Random.
- You see it right, LLMs scored 0 wins. No longer the case, o1-mini being the 1st LLM scroging wins
- Using Draws instead of Wins to evaluate LLMs' chess proficiency.
- The default soritng is by Wins DESC, Draws DESC and Mistakes ASC
- Strong models (those ones winning) are judged (in Chess proficiency) by % Won, weak ones - by % Draws
- Mistakes metric gives an evaluation of LLMs' instruction-following capabilities and resistance to hallucinations (making up non-legal moves while having a list of legal moves provided in the prompt).
- Sort by Mistakes column and get a ranking by instruction-following ability (models with the least mistakes being better)
Libraries and Dependencies Used:
- chess: A Python library for handling chess game rules and basic operations, including board representation, legal move evaluation, and game state evaluation. This is not a chess engine runnig the actual caclulation of the best move.
- Microsoft Autogen is used as a backbone for LLM communication. It also implements the interaction between a Chess Board and custom agents like GameAgent, RandomPlayerAgent, AutoReplyAgent, and others for simulating different player types.
- Stockfish - the chess engine doing the actual best move calculation, used as a referrence to demonstrate what a real chess player's peformance is.
Workflow:
1. The game is initialized with a chess board and two players: a Random Player (as white) and an LLM (as black).
2. The game loop runs until a termination condition is met, such as checkmate, stalemate, or reaching the maximum number of moves.
3. Each player takes turns making a move. The Random Player selects a move randomly from the list of legal moves.
4. The LLM is prompted to make a move using a structured dialog, which includes actions like getting the current board state, retrieving legal moves, and making a move.
5. The game state is updated after each move, and the board is visualized if enabled.
6. Game statistics are generated and stored at the end of the game.
Dialog:
Here is an example of a dialog that prompts a model to make a move:
Proxy (to Player_Black): You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of the following actions: - 'get_current_board' to get the schema and current status of the board - 'get_legal_moves' to get a UCI formatted list of available moves - 'make_move <UCI formatted move>' when you are ready to complete your turn (e.g., 'make_move e2e4') Respond with the action. Player_Black (to Proxy): get_current_board Proxy (to Player_Black): ♞ ♝ ♛ ♚ ♝ ♞ ♜ ♜ ♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟ · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖ Player_Black (to Proxy): get_legal_moves Proxy (to Player_Black): a7a6, b7b6, c7c6, d7d6, e7e6, f7f6, g7g6, h7h6, a7a5, b7b5, c7c5, d7d5, e7e5, f7f5, g7g5, h7h5 Player_Black (to Proxy): make_move e7e5 Proxy (to Player_Black): Move made, switching player
1 move - 1 dialog. The dialog is limmited to 10 turns (10 pairs of Proxy/Agent Request/Response messages). A maximum of 3 mistakes (not conforming to communication notation, picking wrong action or wrong move) are allowed per dialog. The game is terminanted and LLM is given a loss if the max turns OR of number of LLM mistakes is reached in the given dialog.
-------------------
February 12, Legacy GPT-4, get_current_board vs. get_legal_moves
-------------------
The results for gpt-4-0613 and gpt-4-32k-0613 show they performed well, making zero mistakes and achieving numerous draws.
However, older GPT-4 models completely ignored the opportunity to request the board state (get_current_board action), instead relying solely
on the list of legal moves (get_legal_moves action) to make their moves immediately - mimicing the behavior of the random bot, mechanisitcally getting the list of moves and mechanistically picking one. Upon closer examination of outputs from few models,
it was observed that reasoning models (r1, o1/o3) predominantly used get_current_board, demonstrating confidence in their ability to decide
on a legal move without needing additional hints (by requesting get_legal_moves). For o3 models, this confidence increased with higher reasoning effort.
The majority of "traditional" LLMs (GPT-4o, Claude 3.5 Sonnet) utilized both get_current_board and get_legal_moves before making a move.
Here are the ratios for the number of actions issued by models, get_current_board/get_legal_moves:
llama-3.1-tulu-3-8b@q8_0
): As is often the case with fine-tunes,
Tulu performed worse than the original model. It overwhelmed responses with verbosity, generating 1996.33 tokens per 1000 moves compared
to just 162.1 tokens for the original llama3.1-8b
model.Player_Black (to Proxy): Thank you for providing the current board status. Now, I'll check the legal moves available to me. Please provide the result of 'get_legal_moves'. -------------------------------------------------------------------------------- Proxy (to Player_Black): h8g8,f8e7,f8d6,f8c5,f8b4,f8a3,e8d8,e8e7,e8d7,a8d8,a8c8,a8b8,e5e7,e5f6,e5e6,e5d6,e5g5,e5f5,e5d5,e5c5,e5b5,e5a5,e5f4,e5e4,e5d4,e5g3,e5c3,e5h2,e5b2,e5a1,h7g6,h7h6,c7c6,b7b6,a7a6,h7h5,c7c5,b7b5,a7a5 -------------------------------------------------------------------------------- Player_Black (to Proxy): After reviewing the current board and the list of legal moves, I will make my move. I choose to move the knight from e5 to f3, attacking the king and potentially setting up a fork. Please execute 'make_move e5f3'. -------------------------------------------------------------------------------- Proxy (to Player_Black): Failed to make move: illegal uci: 'e5f3' in r3kb1r/ppp3pp/6P1/4q1N1/P3N3/1P1p3P/R1pP1P2/2B2K1R b - - 0 26 -------------------------------------------------------------------------------- Player_Black (to Proxy): Apologies for the mistake. Let me correct my move. I will move the knight from e5 to d3, which is a legal move and aims to challenge the opponent's control of the center. Please execute 'make_move e5d3'. -------------------------------------------------------------------------------- Proxy (to Player_Black): Failed to make move: illegal uci: 'e5d3' in r3kb1r/ppp3pp/6P1/4q1N1/P3N3/1P1p3P/R1pP1P2/2B2K1R b - - 0 26 -------------------------------------------------------------------------------- Player_Black (to Proxy): I apologize for the confusion. Let me review the legal moves again and choose a different move. I will move the queen from e5 to e2, which is a legal move and aims to put pressure on the opponent's position. Please execute 'make_move e5e2'. -------------------------------------------------------------------------------- BREAKING >>> Failed to make move: illegal uci: 'e5e2' in r3kb1r/ppp3pp/6P1/4q1N1/P3N3/1P1p3P/R1pP1P2/2B2K1R b - - 0 26 Proxy (to Player_Black): Too many wrong actions
-------------------
January 22, 2025: R1 Challenges o1, thinking "surrogates"
-------------------
Released on January 21, DeepSeek-R1
made a significant impact by showing similar
or even better performance than OpenAI's o1
models across a range of math and
coding benchmarks. All of this comes at a fraction of the cost when accessing
their 671B model through their API. Besides, the model weights are open, and
there are plenty of details about the training process—a stark contrast to
OpenAI's secrecy regarding the o1/o3
family of models. And that's not all! For
those interested in local models, there are smaller "distilled" reasoning models
at 70B, 32B, 14B, etc., sizes.
So far, none of the so-called reasoning or thinking models have come close to o1
(qwq-32b-preview
, sky-t1-32b-preview
, gemini-2.0-flash-thinking-exp-1219
). They
were a step back in instruction following and could hardly be steered when
compared to non-thinking models. Let alone, they didn't perform as well in chess.
I call them surrogates, verbosity generators mimicking but not replicating o1
's
behavior.
Deepseek-R1 (deepseek-reasoner-r1
) was an extremely curious model to test, and
now there are evaluation results in the Leaderboard.
Additionally, Google has just released an update to its Gemini 2.0 Thinking Model
(gemini-2.0-flash-thinking-exp-01-21
), which previously showed poor results. Two
very interesting releases in two days...
NOTES:
- There are now three models in a league of their own: o1-mini
, o1-preview
, and now
deepseek-reasoner-r1
. These are the only models with significant wins (>10%) and
maintaining a positive material difference (~9) as opposed to a negative material
difference for the resetGameInterval.
- deepseek-reasoner-r1
is much more verbose (almost twice as many tokens as o1
) and
also makes more mistakes. Many games were lost due to the model making mistakes,
unlike o1
models, which mostly fell out of the games due to losing or drawing
(e.g., stalemate, insufficient material).
- deepseek-reasoner-r1
showed distinctive behavior: sometimes it made moves without
requesting the board state or legal moves list (i.e., issues make_move action in
the very first response), sometimes it only requested the board state without
getting the list of moves and decided on the move without a hint (in some cases
picking the right move, in others wrong).
- Overall, o1
models are better all-round models (compared to R1), with better chess
performance, as well as instruction following (Mistakes) and Durability (not
falling out of the games, doing more moves on average).
- The rest of the "thinking" models (sky-t1-32b-preview@q4_0
, qwq-32b-preview@q4_k_m
,
deepseek-r1-distill-qwen-32b@q4_k_m
, deepseek-r1-distill-qwen-14b@q8_0
,
gemini-2.0-flash-thinking-exp-1219
) can hardly follow prompt instructions, lasting
on average 2~14 moves.
- Distilled dense 14B and 32B versions of Deepseek-R1
(deepseek-r1-distill-qwen-32b@q4_k_m
,
deepseek-r1-distill-qwen-14b@q8_0
) did not impress—very poor instruction following,
not even close to the full 671B MoE model. Could quantization have an effect?
Don't know.
- Distilled R1 models use <think></think>
tags to enclose
"reasoning" tokens (those ones hidden for o1
), sometimes they fail to put the
opening <think>
tag (most often in the very first reply).
- The new Gemini Thinking model (gemini-2.0-flash-thinking-exp-01-21
) did improve a
lot in instruction following. While the previous December version
(gemini-2.0-flash-thinking-exp-1219
) was almost useless, staying in the game on
average for 2.3 moves (i.e., breaking instruction prompts in the 2nd move it was
asked to make), the new model lasted on average 40 moves and showed far fewer
mistakes.
In terms of chess, the newer gemini-2.0-flash-thinking-exp-01-21
showed a negative
material difference and negligible draw count—way below other non-thinking models
(even Google's own Gemma 2 9B).
- gemini-2.0-flash-thinking-exp-01-21
also demonstrated much less verbosity in the
outputs, providing to-the-point answers. Didn't investigate it deeply, yet it
seems that now Gemini Thinking does proper isolation of reasoning tokens. Per
docs, "On a standard
request, the model responds with two parts, the thoughts and the model response.
You can check programmatically if a part is a thought or not by seeing if the
part.thought field is set to True." Autogen doesn't seem to register thought
tokens in completion stats for Flash Thinking while it does so for R1.
- Most of the failures/losses from gemini-2.0-flash-thinking-exp-01-21
were due to server errors. Occasionally, the server returned an empty message or flagged
the request with a "RECITATION" error, among others.
Clearly, there's some instability in generation on the server side.
- Earlier non-o1
thinking models did not do well with splitting the reasoning/thinking part
from the final answer, R1 and newer Gemini Thinking do that (yet it didn't help
the distilled R1 models).
-------------------
January 21, 2025: More Models, More Tests
-------------------
- Added the new Claude 3.5 Sonnet (claude-v3-5-sonnet-v2
) and also retested the previous version (claude-v3-5-sonnet-v1
), giving them more runs
(60 this time). What surprised me was not just the fact that v2
showed some wins, but that v1
demonstrated even more wins
(although losing to the random player more often).
Back in November, when I did 30 runs of Sonnet 3.5, there were 0 wins.
Did Anthropic change something in v1
as well, or was it just a question of stochasticity?
- Conducted more runs of GPT-4o models (gpt-4o-2024-05-13
, gpt-4o-2024-08-06
, gpt-4o-2024-11-20
). The August version also achieved a win, yet
it also had its first non-zero mistake (previous 30-run result had a perfect 0 mistakes) - this time, that happened due to Azure's content
filter halting one of the games.
- OpenAI's and Anthropic's models are still the most durable models, i.e.,
demonstrating the least number of mistakes, staying in almost all games
until the very end without falling off due to a random glitch/hallucination
or breaking the prompt instructions. Clearly, it is a challenge for most models
to sustain a 100-turn (200-move game) with an average of 3-4 dialog interactions per game turn.
- Deepseeks R1 distilled into 32B Qwen (deepseek-r1-distill-qwen-32b@q4_k_m
) didn't perform well
- barely able to move past a couple of turns in the chess game, no better than QwQ
or Flash Thinking.
- o1
models are still special being the only models maintaing positive material diff (gaining more piaces rather than losing them)
versus most other models having a significant negative diff suggesting the wins for non-o1
models can hardly be attributed
to models reasoning performance (i.e. with the models meaningfully moving the pieces rather than faking a game).
-------------------
January 16, 2025: First wins by non-reasoning model
-------------------
The November 2024 update to GPT-4o
demonstrated a few wins, something non-o1 models couldn't do before. Yet, what is curious is that the updated
model not only scored wins but also lost to the Random Player more often than others.
-------------------
January 14, 2025 : Reasoning Models, First Wins
-------------------
In my initial notes (see below), I expressed skepticism about LLMs ever being capable of achieving meaningful wins against a Random Player
(let alone a dedicated chess engine). I also mentioned the ARC Challenge as a similar benchmark that is difficult or unsolvable by LLMs.
Yet, in late December 2024, OpenAI introduced their o3 model, which excelled in
ARC-AGI-1.
Around the same time, I received API access to o1-mini, and it was impressive! o1-mini scored 9 wins in 30 games (30%),
making it the first model
capable of meaningful chess games. Next, I tried o1-preview, achieving an even more impressive 46.7% win rate.
The o1 models did something different; they played a meaningful game rather than randomly moving pieces on the board as
seen with older "non-reasoning" models. They formed a league of their own—a "strong" model capable of winning.
I also tested a number of other reasoning models introduced after o1—Google's Gemini 2.0 Flash Thinking (via API),
QwQ 32B (4-bit quantized via LM Studio), and Sky T1 32B (also 4-bit quantized via LM Studio).
The results were underwhelming, to say the least—those 3 models couldn't complete even a single game, often failing
after a few moves (due to breaking the communication protocol or making illegal moves).
Just like the o1 models, they produced lots of tokens, yet unlike o1,
which seems to have 2 separate stages—thinking (those hidden tokens) and producing the
final answer—the other "thinking" models just stream lots of tokens without caring about
following any instructions as to what they are asked to produce.
Verbosity and failing to adhere to instructions seem to plague not just the o1 competitors
but also newer models (e.g., check Deepseek v3).
NOTES:
- When LLMs achieve a 90% win rate against a Random Player, I will replace it with Stockfish.
- You can find conversation logs for
o1-mini here
and for o1-preview
here
- Conversation logs for Gemini 2.0 Thinking,
QwQ
and Sky T1
- While o1 models used a lot of tokens (1221 tokens per move for o1-mini and 2660
for o1-preview vs. below 100 tokens for an average model), their replies were concise (most of the tokens
were hidden ones) and had few mistakes.
- OpenAI models might have an advantage over other models because the prompts have been crafted with GPT-4 models.
This might explain why many models perform worse. Contrary to this point, there are non-OpenAI models (e.g., Anthropic)
that do not struggle with instruction following. I believe the factor of prompts and adjusting prompts for each specific model plays a minor role.
- An idea: while other (non-o1) reasoning models might be poor instruction followers, why
not introduce a deliberate step of final answer synthesizing? I.e., use a cheaper LLM (e.g., 4o-mini)
to get the long scroll from a reasoning model and prompt it to produce a final answer.
Will the reasoning models show better chess abilities?
-------------------
November 12, 2024 : Initial Observations, Opinion
-------------------
Language Models can make moves in a chess game, but they cannot actually play chess or progress towards a victory.
The experiment started after taking a class at DeepLearning, which taught the Microsoft Autogen framework. One of the classes demonstrated
a simulation of a chess game between LLMs. I was immediately intrigued by the idea of putting different LLMs head-to-head in a chess game competition.
However, I was surprised that the naive prompting strategy from the class never led to a game completion. Extending prompts didn't help.
I ended up testing LLMs' performance in a chess game using a Random Player. A human player with reasonable chess skills would have no problems winning
against a random player, yet LLMs failed miserably in this competition.
I suspect that crushing this "LLM Chess" eval might be as hard as the ARC Challenge—a benchmark created to demonstrate
the true nature of text-generating LLMs, exploit their weaknesses, and show how LLMs struggle with reasoning. "It's easy for humans, but hard for AI."
LLMs and Transformer-based models can be trained specifically to play chess. There are projects on the internet where people have fine-tuned LLMs
as chess players. Yet, those are specialized models that can't be good chat models.
I am looking forward to testing new releases of SOTA and frontier models. It would be great to see a model that starts scoring wins against
the chaos monkey while maintaining performance at traditional chat tasks.
NOTES:
- More data on game simulations is available here and
here.
- No history of moves is available to LLM, no reflection used (giving the model "time to think").
- Experiments with reflection suggest that LLMs do even worse when they are prompted to evaluate options before making a move
(reflection results).
- Could LLMs improve their performance if given the whole log of the game?
- The chess engine ( Stockfish) has a 100% win rate with an average game taking 32 moves to complete.
- Stockfish 17, macOS, 0.1ms time limit (vs 0.1s default) - with decreased performance, Stockfish dominates over Random Player.
- Random Player (as white) wins over Random Player (as black) in 10.5% of cases - LLMs scored 0 wins.
- Indeed, giving the right of the first move gives an advantage to the white player.
- LLMs do worse than random players.
- It's as if an LLM had no goal to win, as if it was its intention to keep the game going. What if I prompted it and told it that 200 moves is the maximum and the game ends after? Would it try harder? Can adding to a system prompt an explicitly instruction to Win help?
- While some models are less verbose and follow the rules strictly (e.g., OpenAI), others are verbose.
- Initially, I used exact match when communicating with an LLM and prompted it to reply with action names (show board, get legal moves, make move) - worked well with OpenAI.
- After the list of models was extended, the original prompts had issues steering them.
- As a result, I changed the communication protocol to use regex and be tolerant to reply format, doing its best to extract action and arguments from LLM replies.
- Since models don't score any Wins there must be some alternative metric demosntrating game progress.
- For the time being using Draws, the more draws - the better.
- Yet most of the draws scored are due to hitting the 200 max moves limmit and hence the metric demonstrates the adherence to communication protocol/prompt cpnverntions.
- Logs also contain "Material Count" metrics - the weighted scores of piaces a player has, at the beginning a player has a total of 39 units of material.
- Material difference could be a good metric to evluate progress, the player having more material left as the game progress is at a better position.
- Yet most of the models demonstrated negative material difference and one of the models (gpt-35-turbo-0125) failed to make a single move having a material difference at 0 putting it above models that had negative material while staying in the game much longer.
- It might be resonable to create a computed metric that account for both the material and length of the game, addressing the endless pointless game concern, as well as never changing material diff due to failing to early.
- What if the model is not given the list of legal moves? Will the model figuring out legal moves on their own and struggle to progress? Can giving the models a list of legal moves essentially break reasoning turning the game into simple instruction following (i.e. pick one item from the list rather than win in the game)?
- The older GPT-4 Turbo did better than newer GPT-4o version, this is a yet another eval demonstrating how newer models performed worse supporting the claim the the 4o family of models are smaller and cheaper to run models.