Home ML Papers David Silver - Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (2017)

David Silver - Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (2017)

History / Edit / PDF / EPUB / BIB /
Created: December 6, 2017 / Updated: July 24, 2025 / Status: finished / 2 min read (~256 words)
machine learning

Instead of an alpha-beta search with domain-specific enhancements, AlphaZero uses a general-purpose Monte-Carlo tree search (MCTS) algorithm
The AlphaZero algorithm described in this paper differs from the original AlphaGo Zero algorithm in several respects
- AlphaGo Zero estimates and optimises the probability of winning, assuming binary win/loss outcomes. AlphaZero instead estimates and optimises the expected outcome, taking into account of draws or potentially other outcomes
- In AlphaGo Zero, self-play games were generated by the best player from all previous iterations. After each iteration of training, the performance of the new player was measured against the best player; if it won by a margin of 55% then it replaced the best player and self-play games were subsequently generated by this new player. In contrast, AlphaZero simply maintains a single neural network that is updated continually, rather than waiting for an iteration to complete
- In AlphaZero we reuse the same hyper-parameters for all games without game-specific turning. The sole exception is the noise that is added to the prior policy to ensure exploration; this is scaled in proportion to the typical number of legal moves for that game type
AlphaZero searches just 80 thousand positions in chess and 40 thousand in shogi, compared to 70 million for Stockfish and 35 million for Elmo
AlphaZero compensates for the lower number of evaluations by using its deep neural network to focus much more selectively on the most promising variations - arguably a more "human-like" approach to search, as originally proposed by Shannon