DQN and beyond
Q Learning that actually works :)
Last updated
Q Learning that actually works :)
Last updated
The main goal of this part would be understanding how DQN and its variants solve the key challenges in fitted-Q iteration. Note how seemingly-small changes to fitted-Q such as replay buffers and target networks play just as important a role as deep neural nets in making DQN work successfully.
Replay Buffer: Avoids Correlated Samples If we look at the online Q-learning algorithm
we are updating Q-function after every executed action, which means the samples collected in this fashion will be highly correlated. And the algorithm will be locally overfitting to Q-values in its "neighborhood", as illustrated below:
so that towards the end of these longer trajectories, the Q-values for previous states will become inaccurate. Parallel workers is an option but only partially solve this problem: by adding synchronized or asynchronous workers, the Q-values will be better averaged out across parallel environments at each timestep, but this still doesn't help "remembering" Q-estimates in previous states from earlier in each worker's trajectory. On the other hand, replay buffer is a simple idea that breaks this sample correlation: by putting all experienced sample transitions in one buffer and sampling uniformly (or more smartly by prioritization). Another advantage is that samples in the buffer don't have to be on-policy, since we are only evaluating state-action pairs.
Target Network: Avoids Moving Target If we compare the Q-function update step in Q learning and fitted-Q:
Quickly after the seminal DQN paper came out, many follow-up work on further improving this algorithm were introduced. Below are some of the most commonly used ones:
In implementation, double-Q is just a one-line change in the loss function inside the original DQN's gradient update step. But to understand why it's important and consistently outperforms DQN, it's better to understand why DQN tends to over-estimate Q-values:
we use current Q to eval/select action and target Q to get state-action values, so that we use two Q's that are "noisy in different ways", and solve the problem.
[insert very straightforward formula from Rainbow DQN paper]
Optimization: cross-entropy method (CEM); CMA-ES
Normalized Advantage Functions (NAF): use function class that is easy to optimize
DDPG: learn an approximate maximizer.
The idea is to train a second network that takes in a state, and outputs one action that approximates the true Q-value-maximizing action. So we can train this network with the current Q-function:
All included screenshots credit to Lecture 8 (slides)
in fitted-Q below, gradient update for Q is a regression problem minimizing a l2-loss, this is stable in estimating a Q-value for every under that policy; but in the above (Q-learning), the samples are de-correlated (coming from the buffer), and we are using the Q-function to update itself, i.e. "moving target", this is unstable and might not converge.
Using a target network that only gets updated periodically helps tackle this problem. Note how "classic" analogy of deep-Q is setting
I find this simple general case with random variables pretty intuitive: because both Q-value and argmax action come from the current Q-function network, the target value easily gets over-estimated. But by using Double-Q:
The above classic Q-learning algorithms always select exactly one action for the Q-function, but what if we have a continuous action space, and can't do this discrete action selection? There are a few ways to incorporate this continuity in Q-learning: