Policy and Value Iteration Algorithms
Last updated
Last updated
This section on value-based methods is split into two parts. I will first lay out three classic algorithms: policy iteration, value iteration, fitted-Q iteration; and then shift to state-of-the-art deep Q learning. I think it's a main goal to not only understand each algorithm but also how these value-based methods relate to each other. It's always important to know exactly what you are doing before throwing a deep neural net at it :)
Before, in policy gradient algorithms, we directly learn and update a policy by evaluating it using a value or advantage function. Here in the world of value-based methods, we instead only learn a good value function for each (state, action) pair, so that eventually for each state, we can induce a good policy by just taking the action for the value function at each state we encounter.
If we put together policy iteration and value iteration algorithms, the key difference is whether to update the policy immediately after we finish evaluating all the experienced states under this current policy.
Policy Iteration
For policy iteration algorithms, there are two alternating steps:
Policy Evaluation: for the current policy , each state under this policy is mapped to a value defined recursively as ( ). Note that, these state values are specific to each policy and values of next states are taken as expectation because we might encounter some stochastic state transition dynamics. side note: The above equation is called bootstrapped update because when evaluating the values for next states, we are still using the current estimate of V-function.
Policy Update: Once all states under the current policy are given values, we simply switch to a new policy which takes the action for advantage function at every state. Note that fitting value function is equivalent to fitting advantage function by definition:
However, since the step two above uses a simple update rule, we can omit this action-reselecting step and fit a Q-function instead of V-function, hence the name value iteration:
Notice we can do this because the state value V doesn't depend on action choices, thus the of and is the same. So if we have values, we can use it to get a policy that chooses the best action at every state.
I think this is an interesting question that came up during lecture, and the short answer is, for fully-observed MDPs, there always exists an optimal, deterministic policy (but at the same time there could also be a stochastic one that's equally optimal). However, for partially-observed MDPs such as games, stochastic can outperform any deterministic.
A big assumption for the above policy and value iteration algorithms to work is that all the possible state-action pairs can be put in one reasonably-sized table like this:
and we can use Dynamic Programming to solve for an optimal policy. However, most problems we are trying to solve in RL (e.g. driving a car or playing a challenging game) contain some much larger state and action spaces. Simple DP won't be efficient enough in this case, hence the use of fitted value iterations, whose key distinction from tabular algorithms is the option to use a neural network to parametrize the value function V, and using gradient update to improve its evaluation.
Notice how, if we use this fitted value update above, we still need to iterate over all possible next states induced by all possible next actions, before giving an estimate of the current state value. This means that we need to visit states outside of the sample trajectory we have, and this process can be expensive; plus, sometimes we might not understand the transition dynamics at all. In order to avoid this, we notice Q-functions narrow down our estimate to be state-action pair specific, hence fitted Q-value iteration:
Unfortunately, it's nearly impossible for fitted-Q to converge under large state/action spaces, but it's still important to understand how we get here so we can discuss why the below deep-Q algorithm works.
A few details to better grasp fitted Q algorithm:
It's off-policy: the sample state-action transitions don't need to be generated under the same policy for the Q function to give it an estimate (thus the step-2 update)
In the online version above, we can use different strategies in step-1 to give the algorithm sufficient exploration in the often-huge state-action space. For now, two such strategies to know about are “epsilon-greedy” and “Boltzmann exploration".
Skipped for now, but generally:
All included screenshots credit to Lecture 7 (slides)
We can make it online/on-policy by always updating the policy before taking a new action and gather samples: