Home ML Papers Oriol Vinyals - Starcraft II: A New Challenge for Reinforcement Learning (2017)

Oriol Vinyals - Starcraft II: A New Challenge for Reinforcement Learning (2017)

History / Edit / PDF / EPUB / BIB /
Created: October 5, 2017 / Updated: July 24, 2025 / Status: finished / 3 min read (~479 words)
machine learning

A new reinforcement learning environment has been built for Starcraft 2
- A SC2 API and a Python interface to make it easier to develop agents
Some basic agents have been developed which can solve mini-games
- Mini-games can be seen as some form of curriculum learning of the whole game (learning certain behaviors such as mining, attacking, moving the camera and units, etc.)
Those basic agents are still not very good at playing the complete game in a 1v1 setting

It is a multi-agent problem
It is an imperfect information game
The action space is vast and diverse
- The set of legal actions varies as the player progresses through a tree of possible technologies
Games typically last for many thousands of frames and actions, and the player must make early decisions with consequences that may not be seen until much later in the game, leading to a rich set of challenges in temporal credit assignment and exploration

We define two different reward structure
- Ternary 1 (win)/0 (tie)/-1 (loss) received at the end of a game
- Blizzard score
  - It is computed as the sum of current resources and upgrades researched, as well as units and buildings currently alive and being built

The main observations come as sets of feature layers which are rendered at $N \times M$ pixels
Each of these layers represent something specific in the game, for example: unit type, hit points, owner, or visibility
- Some of these are scalars, while others are categorical
There are two sets of feature layers:
- The minimap is a coarse representation of the state of the entire world
- The screen is a detailed view of a subsection of the world corresponding to the players on-screen view
Feature layers are rendered via a camera that uses a top down orthographic projection
- It means the feature layer rendering does not quite match what a human would see

An action $a$ is represented as a composition of a function identifier $a^0$ and a sequence of arguments which that function identifier requires: $a^1, a^2, \dots, a^L$

Vinyals, Oriol, et al. "StarCraft II: A New Challenge for Reinforcement Learning." arXiv preprint arXiv:1708.04782 (2017).