Reinforcement Learning

The book Reinforcement Learning: An introduction is written by Richard S. Sutton, firstly in 2015 and the second edition just came out in 2017.

1. Definition & Problem

1.1 Definition

1.1.1 Simply speaking

  • Reinforcement learning is a kind of machine learning methods that are goal-directed and learn from interaction
  • RL is about how to map situations to actions so as to maximize a reward signal

1.1.2 Formulation (MDP)

  • Markov Decision Process
  • For most of RL cases, the states and actions spaces are finite, and the MDP is called finite MDP
  • to be continue

1.2 Characteristics

  • RL problems are:
    • closed-loop: the action is decided by input, then action will influence the next input
    • no direct instruction: the learner must try each policy to discover which yield the best reward
    • delayed reward: actions may affect not only the next reward but all the subsequent rewards \(\)
  • MDP suggests that the learning machine must:
    • be able to sense the state
    • be able to take actions that affect the state
    • have a goal or goals that related to the state \(\)
  • RL differs from:
    • supervised learning: in supervised learning, each sample (or called situation) is labeled by a specification - the label (or called the correct action), while in RL there doesn’t exist a “correct instruction of actions”
    • unsupervised learning: RL does not learn hidden structures in unlabeled data
  • These differences make RL a third paradigm of machine learning, instead part of supervised or unsupervised learning

1.3 Challenges and Pros

  • trade-off between exploitation (already known, greedy) and exploration (unknown, global)
  • consider the whole problem, instead split it apart and study subproblems

1.4 Examples

  • A master chess
  • An adaptive controller
  • Learning how to run
  • A robot decides whether to enter a room to collect trash
  • Preparing for a breakfast

These cases share features that they contain interaction with theri environment, in order to achieve a goal despite uncertainty about the environment

1.5 Elements of RL

  • policy: given the environment, what will the agent do
  • reward: defines the goal. A status gives a number, and the total sum of rewards should be maximized
  • value function: total sum of rewards, indicating the long term goodness
  • (optional)model of the environment: infering how environment will behave

2. Methods

2.1 Tabular Solution

Simplest forms of RL: the state and action spaces are small enough for the approximate value functions to be represented as a tabular (array).

  • Dynamic Programming
  • Monte Carlo methods
  • Temporal-difference learning (TD)
  • Ensemble of Above Three

2.2 Approximate Solution

  • When the state or the action space is enormous, we cannot expect finding a optimal solution to the policy or value function. Instead, we find a good approximate solution