- Outcome: Draw
- Max moves reached: 200
- Material White: 16
- Material Black: 18
GPT-4o Mini (Black)
Can a Large Language Model play chess? You can prompt it to make
a move based on the board state, hint it with a list of legal moves (and ask to pick one). You will find that
an LLM can move pieces (but not all of them, some struggle with instruction following), even provide sound comments on why it made a certain move
and what tactic or strategy it followed.
But can LLMs actually make any meaningful moves and win in a chess game? Why don't
we put them up against a random player (i.e., a bot that randomly picks any move from a list of
legal moves for the current position). After all, the models are called Foundational Models for
a reason; they have the knowledge of the entire Internet, can (supposedly) reason, and pass
numerous math evaluations and PhD-level exams. What could be easier for an LLM than to score a victory over
a chaos monkey?
Let's find out ツ
Player
Wins
Draws
Mistakes
Tokens
METRICS:
- Player: Model playing as black against a Random Player. - Wins: How often the player scored a win (due to checkmate or the opponent failing to make a move). Displays LLMs' proficiency in chess. - Draws: Percentage of games without a winner (e.g., reaching the maximum number of 200 moves or stalemate). Displays weaks' LLMs' proficiency in chess if it can't win. - Mistakes: Number of LLM erroneous replies per 1000 moves - how often did an LLM fail to follow the instructions
and make a move. E.g., due to hallucinations, picking illegal moves, not conforming to the communication protocol. Shows the model's instruction-following capabilities and hallucination resistance. - Tokens: Number of tokens generated per move. Demonstrates the model's verbosity.
NOTES:
- LLMs played as black against a Random Player (as white).
- 30+ game simulations for Random Player vs. LLM.
- Bottom rows in green demonstrate how a Chess Engine (Stockfish v17) fares against a Random Player.
- 1000 simulations for Random Player vs. Chess Engine and Random vs. Random.
- You see it right, LLMs scored 0 wins. No longer the case, o1-mini being the 1st LLM scroging wins
- Using Draws instead of Wins to evaluate LLMs' chess proficiency.
- The default soritng is by Wins DESC, Draws DESC and Mistakes ASC
- Strong models (those ones winning) are judged (in Chess proficiency) by % Won, weak ones - by % Draws
- Mistakes metric gives an evaluation of LLMs' instruction-following capabilities and resistance to hallucinations (making up non-legal moves while having a list of legal moves provided in the prompt).
- Sort by Mistakes column and get a ranking by instruction-following ability (models with the least mistakes being better)
Libraries and Dependencies Used:
- chess: A Python library for handling chess game rules and basic operations, including board representation, legal move evaluation, and game state evaluation. This is not a chess engine runnig the actual caclulation of the best move.
- Microsoft Autogen is used as a backbone for LLM comunication. It also implements the interaction between a Chess Board and custom agents like GameAgent, RandomPlayerAgent, AutoReplyAgent, and others for simulating different player types.
- Stockfish - the chess engine doing the actual best move calculation, used as a referrence to demonstrate what a real chess player's peformance is.
Workflow:
1. The game is initialized with a chess board and two players: a Random Player (as white) and an LLM (as black).
2. The game loop runs until a termination condition is met, such as checkmate, stalemate, or reaching the maximum number of moves.
3. Each player takes turns making a move. The Random Player selects a move randomly from the list of legal moves.
4. The LLM is prompted to make a move using a structured dialog, which includes actions like getting the current board state, retrieving legal moves, and making a move.
5. The game state is updated after each move, and the board is visualized if enabled.
6. Game statistics are generated and stored at the end of the game.
Dialog:
Here is an example of a dialog that prompts a model to make a move:
Proxy (to Player_Black):
You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of the following actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'make_move ' when you are ready to complete your turn (e.g., 'make_move e2e4')
Respond with the action.
Player_Black (to Proxy):
get_current_board
Proxy (to Player_Black):♞ ♝ ♛ ♚ ♝ ♞ ♜ ♜
♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙
♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖
Player_Black (to Proxy):
get_legal_moves
Proxy (to Player_Black):
a7a6, b7b6, c7c6, d7d6, e7e6, f7f6, g7g6, h7h6, a7a5, b7b5, c7c5, d7d5, e7e5, f7f5, g7g5, h7h5
Player_Black (to Proxy):
make_move e7e5
Proxy (to Player_Black):
Move made, switching player
1 move - 1 dialog. The dialog is limmited to 10 turns (10 pairs of Proxy/Agent Request/Response messages). A maximum of 3 mistakes (not conforming to
communication notation, picking wrong action or wrong move) are allowed per dialog.
The game is terminanted and LLM is given a loss if the max turns OR of number of LLM mistakes is reached in the given dialog.
-------------------
January 16, 2025 : First wins by non-reasoning model
-------------------
November 2024 update to GPT-4o has demonstrated a few wins, something non o1 models couldn't do before. Yet what is curious, the updated
model not just scoring wins, it losses to Random Player more often than others.
-------------------
January 14, 2025 : Reasoning Models, First Wins
-------------------
In my initial notes (see below), I expressed skepticism about LLMs ever being capable of achieving meaningful wins against a Random Player
(let alone a dedicated chess engine). I also mentioned the ARC Challenge as a similar benchmark that is difficult or unsolvable by LLMs.
Yet, in late December 2024, OpenAI introduced their o3 model, which excelled in
ARC-AGI-1.
Around the same time, I received API access to o1-mini, and it was impressive! o1-mini scored 9 wins in 30 games (30%),
making it the first model
capable of meaningful chess games. Next, I tried o1-preview, achieving an even more impressive 46.7% win rate.
The o1 models did something different; they played a meaningful game rather than randomly moving pieces on the board as
seen with older "non-reasoning" models. They formed a league of their own—a "strong" model capable of winning.
I also tested a number of other reasoning models introduced after o1—Google's Gemini 2.0 Flash Thinking (via API),
QwQ 32B (4-bit quantized via LM Studio), and Sky T1 32B (also 4-bit quantized via LM Studio).
The results were underwhelming, to say the least—those 3 models couldn't complete even a single game, often failing
after a few moves (due to breaking the communication protocol or making illegal moves).
Just like the o1 models, they produced lots of tokens, yet unlike o1,
which seems to have 2 separate stages—thinking (those hidden tokens) and producing the
final answer—the other "thinking" models just stream lots of tokens without caring about
following any instructions as to what they are asked to produce.
Verbosity and failing to adhere to instructions seem to plague not just the o1 competitors
but also newer models (e.g., check Deepseek v3).
NOTES:
- When LLMs achieve a 90% win rate against a Random Player, I will replace it with Stockfish.
- You can find conversation logs for
o1-mini here
and for o1-preview
here
- Conversation logs for Gemini 2.0 Thinking,
QwQ
and Sky T1
- While o1 models used a lot of tokens (1221 tokens per move for o1-mini and 2660
for o1-preview vs. below 100 tokens for an average model), their replies were concise (most of the tokens
were hidden ones) and had few mistakes.
- OpenAI models might have an advantage over other models because the prompts have been crafted with GPT-4 models.
This might explain why many models perform worse. Contrary to this point, there are non-OpenAI models (e.g., Anthropic)
that do not struggle with instruction following. I believe the factor of prompts and adjusting prompts for each specific model plays a minor role.
- An idea: while other (non-o1) reasoning models might be poor instruction followers, why
not introduce a deliberate step of final answer synthesizing? I.e., use a cheaper LLM (e.g., 4o-mini)
to get the long scroll from a reasoning model and prompt it to produce a final answer.
Will the reasoning models show better chess abilities?
-------------------
November 12, 2024 : Initial Observations, Opinion
-------------------
Language Models can make moves in a chess game, but they cannot actually play chess or progress towards a victory.
The experiment started after taking a class at DeepLearning, which taught the Microsoft Autogen framework. One of the classes demonstrated
a simulation of a chess game between LLMs. I was immediately intrigued by the idea of putting different LLMs head-to-head in a chess game competition.
However, I was surprised that the naive prompting strategy from the class never led to a game completion. Extending prompts didn't help.
I ended up testing LLMs' performance in a chess game using a Random Player. A human player with reasonable chess skills would have no problems winning
against a random player, yet LLMs failed miserably in this competition.
I suspect that crushing this "LLM Chess" eval might be as hard as the ARC Challenge—a benchmark created to demonstrate
the true nature of text-generating LLMs, exploit their weaknesses, and show how LLMs struggle with reasoning. "It's easy for humans, but hard for AI."
LLMs and Transformer-based models can be trained specifically to play chess. There are projects on the internet where people have fine-tuned LLMs
as chess players. Yet, those are specialized models that can't be good chat models.
I am looking forward to testing new releases of SOTA and frontier models. It would be great to see a model that starts scoring wins against
the chaos monkey while maintaining performance at traditional chat tasks.
NOTES:
- More data on game simulations is available here and
here.
- No history of moves is available to LLM, no reflection used (giving the model "time to think").
- Experiments with reflection suggest that LLMs do even worse when they are prompted to evaluate options before making a move
(reflection results).
- Could LLMs improve their performance if given the whole log of the game?
- The chess engine ( Stockfish) has a 100% win rate with an average game taking 32 moves to complete.
- Stockfish 17, macOS, 0.1ms time limit (vs 0.1s default) - with decreased performance, Stockfish dominates over Random Player.
- Random Player (as white) wins over Random Player (as black) in 10.5% of cases - LLMs scored 0 wins.
- Indeed, giving the right of the first move gives an advantage to the white player.
- LLMs do worse than random players.
- It's as if an LLM had no goal to win, as if it was its intention to keep the game going. What if I prompted it and told it that 200 moves is the maximum and the game ends after? Would it try harder? Can adding to a system prompt an explicitly instruction to Win help?
- While some models are less verbose and follow the rules strictly (e.g., OpenAI), others are verbose.
- Initially, I used exact match when communicating with an LLM and prompted it to reply with action names (show board, get legal moves, make move) - worked well with OpenAI.
- After the list of models was extended, the original prompts had issues steering them.
- As a result, I changed the communication protocol to use regex and be tolerant to reply format, doing its best to extract action and arguments from LLM replies.
- Since models don't score any Wins there must be some alternative metric demosntrating game progress.
- For the time being using Draws, the more draws - the better.
- Yet most of the draws scored are due to hitting the 200 max moves limmit and hence the metric demonstrates the adherence to communication protocol/prompt cpnverntions.
- Logs also contain "Material Count" metrics - the weighted scores of piaces a player has, at the beginning a player has a total of 39 units of material.
- Material difference could be a good metric to evluate progress, the player having more material left as the game progress is at a better position.
- Yet most of the models demonstrated negative material difference and one of the models (gpt-35-turbo-0125) failed to make a single move having a material difference at 0 putting it above models that had negative material while staying in the game much longer.
- It might be resonable to create a computed metric that account for both the material and length of the game, addressing the endless pointless game concern, as well as never changing material diff due to failing to early.
- What if the model is not given the list of legal moves? Will the model figuring out legal moves on their own and struggle to progress? Can giving the models a list of legal moves essentially break reasoning turning the game into simple instruction following (i.e. pick one item from the list rather than win in the game)?
- The older GPT-4 Turbo did better than newer GPT-4o version, this is a yet another eval demonstrating how newer models performed worse supporting the claim the the 4o family of models are smaller and cheaper to run models.