Reinforcement Learning

The book Reinforcement Learning: An introduction is written by Richard S. Sutton, firstly in 2015 and the second edition just came out in 2017.

1. Definition & Problem

1.1 Definition

1.1.1 Simply speaking

Reinforcement learning is a kind of machine learning methods that are goal-directed and learn from interaction
RL is about how to map situations to actions so as to maximize a reward signal

1.1.2 Formulation (MDP)

Markov Decision Process
For most of RL cases, the states and actions spaces are finite, and the MDP is called finite MDP
to be continue

1.2 Characteristics

RL problems are:
- closed-loop: the action is decided by input, then action will influence the next input
- no direct instruction: the learner must try each policy to discover which yield the best reward
- delayed reward: actions may affect not only the next reward but all the subsequent rewards \(\)
MDP suggests that the learning machine must:
- be able to sense the state
- be able to take actions that affect the state
- have a goal or goals that related to the state \(\)
RL differs from:
- supervised learning: in supervised learning, each sample (or called situation) is labeled by a specification - the label (or called the correct action), while in RL there doesn’t exist a “correct instruction of actions”
- unsupervised learning: RL does not learn hidden structures in unlabeled data
These differences make RL a third paradigm of machine learning, instead part of supervised or unsupervised learning

1.3 Challenges and Pros

trade-off between exploitation (already known, greedy) and exploration (unknown, global)
consider the whole problem, instead split it apart and study subproblems

1.4 Examples

A master chess
An adaptive controller
Learning how to run
A robot decides whether to enter a room to collect trash
Preparing for a breakfast

These cases share features that they contain interaction with theri environment, in order to achieve a goal despite uncertainty about the environment

1.5 Elements of RL

policy: given the environment, what will the agent do
reward: defines the goal. A status gives a number, and the total sum of rewards should be maximized
value function: total sum of rewards, indicating the long term goodness
(optional)model of the environment: infering how environment will behave

2. Methods

2.1 Tabular Solution

Simplest forms of RL: the state and action spaces are small enough for the approximate value functions to be represented as a tabular (array).

Dynamic Programming
Monte Carlo methods
Temporal-difference learning (TD)
Ensemble of Above Three

2.2 Approximate Solution

When the state or the action space is enormous, we cannot expect finding a optimal solution to the policy or value function. Instead, we find a good approximate solution