📇
DeepRL
  • CS285: Deep RL Notes
  • RL Fundamentals
  • Policy Gradient
    • Policy Gradient Basics
    • Actor Critic Algorithms
    • Advanced Policy Gradients
  • Value Based Methods
    • Policy and Value Iteration Algorithms
    • DQN and beyond
  • Model-based Methods
    • Model-based Planning and Model-based Predictive Control
    • Model-based Policy Learning
  • Inference, Control, and Inverse RL
    • Latent Models and Variational Inference
    • Control as Inference
    • Inverse Reinforcement Learning
  • Transfer Learning in RL
    • Transfer and Multi-task Learning
    • Paper Reading Notes
  • Coming soon...
    • Offline RL
    • RL from Pixels
Powered by GitBook
On this page
  • Major Improvements on DQN
  1. Value Based Methods

DQN and beyond

Q Learning that actually works :)

PreviousPolicy and Value Iteration AlgorithmsNextModel-based Planning and Model-based Predictive Control

Last updated 4 years ago

The main goal of this part would be understanding how DQN and its variants solve the key challenges in fitted-Q iteration. Note how seemingly-small changes to fitted-Q such as replay buffers and target networks play just as important a role as deep neural nets in making DQN work successfully.

  • Replay Buffer: Avoids Correlated Samples If we look at the online Q-learning algorithm

    we are updating Q-function after every executed action, which means the samples collected in this fashion will be highly correlated. And the algorithm will be locally overfitting to Q-values in its "neighborhood", as illustrated below:

    so that towards the end of these longer trajectories, the Q-values for previous states will become inaccurate. Parallel workers is an option but only partially solve this problem: by adding synchronized or asynchronous workers, the Q-values will be better averaged out across parallel environments at each timestep, but this still doesn't help "remembering" Q-estimates in previous states from earlier in each worker's trajectory. On the other hand, replay buffer is a simple idea that breaks this sample correlation: by putting all experienced sample transitions in one buffer and sampling uniformly (or more smartly by prioritization). Another advantage is that samples in the buffer don't have to be on-policy, since we are only evaluating state-action pairs.

  • Target Network: Avoids Moving Target If we compare the Q-function update step in Q learning and fitted-Q:

Major Improvements on DQN

Quickly after the seminal DQN paper came out, many follow-up work on further improving this algorithm were introduced. Below are some of the most commonly used ones:

Double-DQN

In implementation, double-Q is just a one-line change in the loss function inside the original DQN's gradient update step. But to understand why it's important and consistently outperforms DQN, it's better to understand why DQN tends to over-estimate Q-values:

we use current Q to eval/select action and target Q to get state-action values, so that we use two Q's that are "noisy in different ways", and solve the problem.

Multi-step return

[insert very straightforward formula from Rainbow DQN paper]

Q-learning with Continuous Actions

Optimization: cross-entropy method (CEM); CMA-ES

Normalized Advantage Functions (NAF): use function class that is easy to optimize

DDPG: learn an approximate maximizer.

The idea is to train a second network that takes in a state, and outputs one action that approximates the true Q-value-maximizing action. So we can train this network with the current Q-function:

in fitted-Q below, gradient update for Q is a regression problem minimizing a l2-loss, this is stable in estimating a Q-value for every (s,a)(s,a)(s,a) under that policy; but in the above (Q-learning), the samples are de-correlated (coming from the buffer), and we are using the Q-function to update itself, i.e. "moving target", this is unstable and might not converge.

Using a target network that only gets updated periodically helps tackle this problem. Note how "classic" analogy of deep-Q is setting N=K=1N = K = 1N=K=1

I find this simple general case with random variables X1,X2X1, X_2X1,X2​ pretty intuitive: because both Q-value and argmax action come from the current Q-function network, the target value easily gets over-estimated. But by using Double-Q:

The above classic Q-learning algorithms always select exactly one argmax\text{argmax}argmax action for the Q-function, but what if we have a continuous action space, and can't do this discrete action selection? There are a few ways to incorporate this continuity in Q-learning:

All included screenshots credit to Lecture 8

(slides)