DQN and beyond

Q Learning that actually works :)

The main goal of this part would be understanding how DQN and its variants solve the key challenges in fitted-Q iteration. Note how seemingly-small changes to fitted-Q such as replay buffers and target networks play just as important a role as deep neural nets in making DQN work successfully.

  • so that towards the end of these longer trajectories, the Q-values for previous states will become inaccurate. Parallel workers is an option but only partially solve this problem: by adding synchronized or asynchronous workers, the Q-values will be better averaged out across parallel environments at each timestep, but this still doesn't help "remembering" Q-estimates in previous states from earlier in each worker's trajectory. On the other hand, replay buffer is a simple idea that breaks this sample correlation: by putting all experienced sample transitions in one buffer and sampling uniformly (or more smartly by prioritization). Another advantage is that samples in the buffer don't have to be on-policy, since we are only evaluating state-action pairs.

  • Target Network: Avoids Moving Target If we compare the Q-function update step in Q learning and fitted-Q:

in fitted-Q below, gradient update for Q is a regression problem minimizing a l2-loss, this is stable in estimating a Q-value for every (s,a)(s,a) under that policy; but in the above (Q-learning), the samples are de-correlated (coming from the buffer), and we are using the Q-function to update itself, i.e. "moving target", this is unstable and might not converge.

Using a target network that only gets updated periodically helps tackle this problem. Note how "classic" analogy of deep-Q is setting N=K=1N = K = 1

Major Improvements on DQN

Quickly after the seminal DQN paper came out, many follow-up work on further improving this algorithm were introduced. Below are some of the most commonly used ones:

Double-DQN

In implementation, double-Q is just a one-line change in the loss function inside the original DQN's gradient update step. But to understand why it's important and consistently outperforms DQN, it's better to understand why DQN tends to over-estimate Q-values:

I find this simple general case with random variables X1,X2X1, X_2 pretty intuitive: because both Q-value and argmax action come from the current Q-function network, the target value easily gets over-estimated. But by using Double-Q:

we use current Q to eval/select action and target Q to get state-action values, so that we use two Q's that are "noisy in different ways", and solve the problem.

Multi-step return

[insert very straightforward formula from Rainbow DQN paper]

Q-learning with Continuous Actions

The above classic Q-learning algorithms always select exactly one argmax\text{argmax} action for the Q-function, but what if we have a continuous action space, and can't do this discrete action selection? There are a few ways to incorporate this continuity in Q-learning:

Optimization: cross-entropy method (CEM); CMA-ES

Normalized Advantage Functions (NAF): use function class that is easy to optimize

DDPG: learn an approximate maximizer.

The idea is to train a second network that takes in a state, and outputs one action that approximates the true Q-value-maximizing action. So we can train this network with the current Q-function:

All included screenshots credit to Lecture 8 (slides)

Last updated