Can a Large Language Model play chess? You can prompt it to make
a move based on the board state, hint it with a list legal moves (and ask to pick one). You will find that
an LLM can move pieces, even provide sound comments on why it made a certain move
and what tactic or strategy it followed.
But can LLMs actually make any meaningful moves and win in a chess game? Why don't
we put them up against a random player (i.e., a bot that randomly picks any move from a list of
legal moves for the current position). After all, the models are called Foundational Models for
a reason; they have the knowledge of the entire Internet, can (supposedly) reason, and pass
numerous math evaluations and PhD-level exams. What could be easier for an LLM than to score a victory over
a chaos monkey?
- Outcome: Draw
- Max moves reached: 200
- Material White: 16
- Material Black: 18
GPT-4o Mini (Black)
Player
Draws
Wins
Mistakes
Tokens
NOTES:
- LLMs played as black against a Random Player (as white).
- 30 game simulations for Random Player vs. LLM.
- Bottom rows in green demonstrate simulations of how a Chess Engine (Stockfish v17) fares against a Random Player.
- 1000 simulations for Random Player vs. Chess Engine and Random vs. Random.
- You see it right, LLMs scored 0 wins.
- Using Draws instead of Wins to evaluate LLMs' chess proficiency.
- Mistakes metric gives an evaluation of LLMs' instruction-following capabilities and resistance to hallucinations (making up non-legal moves while having a list of legal moves provided in the prompt).
METRICS: - Player: Model playing as black against a Random Player. - Draws: Percentage of games without a winner (e.g., reaching the maximum number of 200 moves or stalemate). Displays LLMs' proficiency in chess. - Wins: How often the player scored a win (due to checkmate or the opponent failing to make a move). - Mistakes: Number of LLM erroneous replies per 1000 moves - how often did an LLM fail to follow the instructions and make a move, e.g., due to hallucinations picking illegal moves and not conforming to the communication protocol. Shows the model's instruction-following capabilities and hallucination resistance. - Tokens: Number of tokens generated per move. Demonstrates the model's verbosity.
Libraries and Dependencies Used:
- chess: A Python library for handling chess game rules and basic operations, including board representation, legal move evaluation, and game state evaluation. This is not a chess engine runnig the actual caclulation of the best move.
- Microsoft Autogen is used as a backbone for LLM comunication. It also implements the interaction between a Chess Board and custom agents like GameAgent, RandomPlayerAgent, AutoReplyAgent, and others for simulating different player types.
- Stockfish - the chess engine doing the actual best move calculation, used as a referrence to demonstrate what a real chess player's peformance is.
Workflow:
1. The game is initialized with a chess board and two players: a Random Player (as white) and an LLM (as black).
2. The game loop runs until a termination condition is met, such as checkmate, stalemate, or reaching the maximum number of moves.
3. Each player takes turns making a move. The Random Player selects a move randomly from the list of legal moves.
4. The LLM is prompted to make a move using a structured dialog, which includes actions like getting the current board state, retrieving legal moves, and making a move.
5. The game state is updated after each move, and the board is visualized if enabled.
6. Game statistics are generated and stored at the end of the game.
Dialog:
Here is an example of a dialog that prompts a model to make a move:
Proxy (to Player_Black):
You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of the following actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'make_move ' when you are ready to complete your turn (e.g., 'make_move e2e4')
Respond with the action.
Player_Black (to Proxy):
get_current_board
Proxy (to Player_Black):♞ ♝ ♛ ♚ ♝ ♞ ♜ ♜
♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙
♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖
Player_Black (to Proxy):
get_legal_moves
Proxy (to Player_Black):
a7a6, b7b6, c7c6, d7d6, e7e6, f7f6, g7g6, h7h6, a7a5, b7b5, c7c5, d7d5, e7e5, f7f5, g7g5, h7h5
Player_Black (to Proxy):
make_move e7e5
Proxy (to Player_Black):
Move made, switching player
1 move - 1 dialog. The dialog is limmited to 10 turns (10 pairs of Proxy/Agent Request/Response messages). A maximum of 3 mistakes (not conforming to
communication notation, picking wrong action or wrong move) are allowed per dialog.
The game is terminanted and LLM is given a loss if the max turns OR of number of LLM mistakes is reached in the given dialog.
Language Models can make moves in a chess game, but they cannot actually play chess or progress towards a victory.
The experiment started after taking a class at DeepLearning, which taught the Microsoft Autogen framework. One of the classes demonstrated
a simulation of a chess game between LLMs. I was immediately intrigued by the idea of putting different LLMs head-to-head in a chess game competition.
However, I was surprised that the naive prompting strategy from the class never led to a game completion. Extending prompts didn't help.
I ended up testing LLMs' performance in a chess game using a Random Player. A human player with reasonable chess skills would have no problems winning
against a random player, yet LLMs failed miserably in this competition.
I suspect that crushing this "LLM Chess" eval might be as hard as the ARC Challenge—a benchmark created to demonstrate
the true nature of text-generating LLMs, exploit their weaknesses, and show how LLMs struggle with reasoning. "It's easy for humans, but hard for AI."
LLMs and Transformer-based models can be trained specifically to play chess. There are projects on the internet where people have fine-tuned LLMs
as chess players. Yet, those are specialized models that can't be good chat models.
I am looking forward to testing new releases of SOTA and frontier models. It would be great to see a model that starts scoring wins against
the chaos monkey while maintaining performance at traditional chat tasks.
NOTES:
- More data on game simulations is available here and
here.
- No history of moves is available to LLM, no reflection used (giving the model "time to think").
- Experiments with reflection suggest that LLMs do even worse when they are prompted to evaluate options before making a move
(reflection results).
- Could LLMs improve their performance if given the whole log of the game?
- The chess engine ( Stockfish) has a 100% win rate with an average game taking 32 moves to complete.
- Stockfish 17, macOS, 0.1ms time limit (vs 0.1s default) - with decreased performance, Stockfish dominates over Random Player.
- Random Player (as white) wins over Random Player (as black) in 10.5% of cases - LLMs scored 0 wins.
- Indeed, giving the right of the first move gives an advantage to the white player.
- LLMs do worse than random players.
- It's as if an LLM had no goal to win, as if it was its intention to keep the game going. What if I prompted it and told it that 200 moves is the maximum and the game ends after? Would it try harder? Can adding to a system prompt an explicitly instruction to Win help?
- While some models are less verbose and follow the rules strictly (e.g., OpenAI), others are verbose.
- Initially, I used exact match when communicating with an LLM and prompted it to reply with action names (show board, get legal moves, make move) - worked well with OpenAI.
- After the list of models was extended, the original prompts had issues steering them.
- As a result, I changed the communication protocol to use regex and be tolerant to reply format, doing its best to extract action and arguments from LLM replies.
- Since models don't score any Wins there must be some alternative metric demosntrating game progress.
- For the time being using Draws, the more draws - the better.
- Yet most of the draws scored are due to hitting the 200 max moves limmit and hence the metric demonstrates the adherence to communication protocol/prompt cpnverntions.
- Logs also contain "Material Count" metrics - the weighted scores of piaces a player has, at the beginning a player has a total of 39 units of material.
- Material difference could be a good metric to evluate progress, the player having more material left as the game progress is at a better position.
- Yet most of the models demonstrated negative material difference and one of the models (gpt-35-turbo-0125) failed to make a single move having a material difference at 0 putting it above models that had negative material while staying in the game much longer.
- It might be resonable to create a computed metric that account for both the material and length of the game, addressing the endless pointless game concern, as well as never changing material diff due to failing to early.
- What if the model is not given the list of legal moves? Will the model figuring out legal moves on their own and struggle to progress? Can giving the models a list of legal moves essentially break reasoning turning the game into simple instruction following (i.e. pick one item from the list rather than win in the game)?
- The older GPT-4 Turbo did better than newer GPT-4o version, this is a yet another eval demonstrating how newer models performed worse supporting the claim the the 4o family of models are smaller and cheaper to run models.