Home ML Papers Ronen Brafman - R-max – A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning (2002)

Ronen Brafman - R-max – A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning (2002)

History / Edit / PDF / EPUB / BIB /
Created: December 8, 2017 / Updated: December 21, 2025 / Status: finished / Readability: technical / 3 min read (~493 words)
machine-learning

Why do they say "Since there is only a polynomial number of parameters to learn"? What are those parameters?

THe major insight behind the R-max algorithm is that the optimal policy with respect to the real model: it is always either optimal or it leads to efficient learning
The approach taken by R-max is not new. It has been referred to as the optimist in the face of uncertainty heuristic

A game is a model of multi-agent interaction
In a game, we have a set of players, each of whom chooses some action to perform from a given set of actions
A common description of a game is as a matrix. This is called a game in strategic form
- The rows of the matrix correspond to the agent's actions and the columns correspond to the adversary's actions. The entry in row $i$ and column $j$ in the game matrix contains the rewards obtained by the agent and the adversary if the agent plays his $i^{th}$ action and the adversary plays his $j^{th}$ action

We make two central assumptions:
- We assume that the agent always recognizes the identity of the state it reached and that after playing a game, it knows what actions were taken by its adversary and what payoffs were obtained
- We assume that the maximal possible reward $R_{max}$ is known ahead of time
We start learning at some state $s$ in which the optimal action is $a$
- If we do not execute the action $a$ in $s$, we reach some state $s'$ that has a lower value
- A learning algorithm without any prior knowledge cannot be expected to immediately guess that $a$ should be done in $s$
- Without such prior knowledge, it cannot conclude that $a$ is a good action unless it tries the other actions in $s$ and compares their outcome to that of $a$
- One can expect an agent to learn a near-optimal policy only if the agent can visit state $s$ sufficiently many times to learn about the consequences of different options in $s$

Brafman, Ronen I., and Moshe Tennenholtz. "R-max-a general polynomial time algorithm for near-optimal reinforcement learning." Journal of Machine Learning Research 3.Oct (2002): 213-231.