__       __                    ____     __                               
/\ \     /\ \       /'\_/`\    /\  _``. /\ \                              
\ \ \    \ \ \     /\      \   \ \ \/\_\\ \ \___      __    ____    ____  
 \ \ \  __\ \ \  __\ \ \__\ \   \ \ \/_/_\ \  _ `\  /'__`\ /',__\  /',__\ 
  \ \ \L\ \\ \ \L\ \\ \ \_/\ \   \ \ \L\ \\ \ \ \ \/\  __//\__, `\/\__, `\
   \ \____/ \ \____/ \ \_\\ \_\   \ \____/ \ \_\ \_\ \____\/\____/\/\____/
    \/___/   \/___/   \/_/ \/_/    \/___/   \/_/\/_/\/____/\/___/  \/___/ 

        
         __         __         __    __                        
        /\ \       /\ \       /\ "-./  \                       
        \ \ \____  \ \ \____  \ \ \-./\ \                      
         \ \_____\  \ \_____\  \ \_\ \ \_\                     
          \/_____/   \/_____/   \/_/  \/_/                                                                                
 ______     __  __     ______     ______     ______    
/\  ___\   /\ \_\ \   /\  ___\   /\  ___\   /\  ___\   
\ \ \____  \ \  __ \  \ \  __\   \ \___  \  \ \___  \  
 \ \_____\  \ \_\ \_\  \ \_____\  \/\_____\  \/\_____\ 
  \/_____/   \/_/\/_/   \/_____/   \/_____/   \/_____/ 
        
Random Player (White)
♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜
♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙
♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖

GAME OVER

- Outcome: Draw
- Max moves reached: 200
- Material White: 16
- Material Black: 18
GPT-4o Mini (Black)

Can a Large Language Model play chess? You can prompt it to make a move based on the board state, hint it with a list of legal moves (and ask to pick one). You will find that an LLM can move pieces (but not all of them, some struggle with instruction following), even provide sound comments on why it made a certain move and what tactic or strategy it followed.

But can LLMs actually make any meaningful moves and win in a chess game? Why don't we put them up against a random player (i.e., a bot that randomly picks any move from a list of legal moves for the current position). After all, the models are called Foundational Models for a reason; they have the knowledge of the entire Internet, can (supposedly) reason, and pass numerous math evaluations and PhD-level exams. What could be easier for an LLM than to score a victory over a chaos monkey?

Let's find out ツ

Player   Wins   Draws   Mistakes   Tokens  

METRICS:

- Player: Model playing as black against a Random Player.
- Wins: How often the player scored a win (due to checkmate or the opponent failing to make a move). Displays LLMs' proficiency in chess.
- Draws: Percentage of games without a winner (e.g., reaching the maximum number of 200 moves or stalemate). Displays weaks' LLMs' proficiency in chess if it can't win.
- Mistakes: Number of LLM erroneous replies per 1000 moves - how often did an LLM fail to follow the instructions and make a move. E.g., due to hallucinations, picking illegal moves, not conforming to the communication protocol. Shows the model's instruction-following capabilities and hallucination resistance.
- Tokens: Number of tokens generated per move. Demonstrates the model's verbosity.

NOTES:

- LLMs played as black against a Random Player (as white).
- 30+ game simulations for Random Player vs. LLM.
- Bottom rows in green demonstrate how a Chess Engine (Stockfish v17) fares against a Random Player.
- 1000 simulations for Random Player vs. Chess Engine and Random vs. Random.
- You see it right, LLMs scored 0 wins. No longer the case, o1-mini being the 1st LLM scroging wins
- Using Draws instead of Wins to evaluate LLMs' chess proficiency.
- The default soritng is by Wins DESC, Draws DESC and Mistakes ASC
- Strong models (those ones winning) are judged (in Chess proficiency) by % Won, weak ones - by % Draws
- Mistakes metric gives an evaluation of LLMs' instruction-following capabilities and resistance to hallucinations (making up non-legal moves while having a list of legal moves provided in the prompt).
- Sort by Mistakes column and get a ranking by instruction-following ability (models with the least mistakes being better)

GitHub Icon Project's GitHub