__       __                    ____     __                               
/\ \     /\ \       /'\_/`\    /\  _``. /\ \                              
\ \ \    \ \ \     /\      \   \ \ \/\_\\ \ \___      __    ____    ____  
 \ \ \  __\ \ \  __\ \ \__\ \   \ \ \/_/_\ \  _ `\  /'__`\ /',__\  /',__\ 
  \ \ \L\ \\ \ \L\ \\ \ \_/\ \   \ \ \L\ \\ \ \ \ \/\  __//\__, `\/\__, `\
   \ \____/ \ \____/ \ \_\\ \_\   \ \____/ \ \_\ \_\ \____\/\____/\/\____/
    \/___/   \/___/   \/_/ \/_/    \/___/   \/_/\/_/\/____/\/___/  \/___/ 

        
         __         __         __    __                        
        /\ \       /\ \       /\ "-./  \                       
        \ \ \____  \ \ \____  \ \ \-./\ \                      
         \ \_____\  \ \_____\  \ \_\ \ \_\                     
          \/_____/   \/_____/   \/_/  \/_/                                                                                
 ______     __  __     ______     ______     ______    
/\  ___\   /\ \_\ \   /\  ___\   /\  ___\   /\  ___\   
\ \ \____  \ \  __ \  \ \  __\   \ \___  \  \ \___  \  
 \ \_____\  \ \_\ \_\  \ \_____\  \/\_____\  \/\_____\ 
  \/_____/   \/_/\/_/   \/_____/   \/_____/   \/_____/ 
        

Can a Large Language Model play chess? You can prompt it to make a move based on the board state, hint it with a list legal moves (and ask to pick one). You will find that an LLM can move pieces, even provide sound comments on why it made a certain move and what tactic or strategy it followed.

But can LLMs actually make any meaningful moves and win in a chess game? Why don't we put them up against a random player (i.e., a bot that randomly picks any move from a list of legal moves for the current position). After all, the models are called Foundational Models for a reason; they have the knowledge of the entire Internet, can (supposedly) reason, and pass numerous math evaluations and PhD-level exams. What could be easier for an LLM than to score a victory over a chaos monkey?

Let's find out ツ

Random Player (White)
♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜
♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙
♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖

GAME OVER

- Outcome: Draw
- Max moves reached: 200
- Material White: 16
- Material Black: 18
GPT-4o Mini (Black)
Player   Draws   Wins   Mistakes   Tokens  

NOTES:
- LLMs played as black against a Random Player (as white).
- 30 game simulations for Random Player vs. LLM.
- Bottom rows in green demonstrate simulations of how a Chess Engine (Stockfish v17) fares against a Random Player.
- 1000 simulations for Random Player vs. Chess Engine and Random vs. Random.
- You see it right, LLMs scored 0 wins.
- Using Draws instead of Wins to evaluate LLMs' chess proficiency.
- Mistakes metric gives an evaluation of LLMs' instruction-following capabilities and resistance to hallucinations (making up non-legal moves while having a list of legal moves provided in the prompt).

METRICS:
- Player: Model playing as black against a Random Player.
- Draws: Percentage of games without a winner (e.g., reaching the maximum number of 200 moves or stalemate). Displays LLMs' proficiency in chess.
- Wins: How often the player scored a win (due to checkmate or the opponent failing to make a move).
- Mistakes: Number of LLM erroneous replies per 1000 moves - how often did an LLM fail to follow the instructions and make a move, e.g., due to hallucinations picking illegal moves and not conforming to the communication protocol. Shows the model's instruction-following capabilities and hallucination resistance.
- Tokens: Number of tokens generated per move. Demonstrates the model's verbosity.

Project's GitHub