Advanced Policy Gradients

A Policy Iteration View for PG

We learned about the idea of policy iteration when discussing Actor Critic Algorithms. Here, let's compare and contrast policy gradient (PG) and policy iteration (PI). Recall that both have a policy-improvement step, but they differ in how to update the policy: PG takes gradient of the objective w.r.t. policy parameters, then re-parametrizes the policy, while PI doesn't need parameters: it induces a new policy by simply taking the argmax\text{argmax} action for the advantage value. Notice how gradient of the objective can be written as the logp\log p-weighted advantage under the current policy.

Using PG, we can define a new value called policy improvement as the difference between the objective values under the current and the next gradient-updated policies, I'm omitting the derivations below, but you can read the derivation in the lecture slides that shows this improvement value equals the expected advantage of the old policy, expectation taken over the new policy. My own way to interpret this is that, if you move the old objective J(θ)J(\theta) to the right hand side of the second equation below, we can see that the new objective comes from a new policy that increases/adds to the old objective value by re-adjusting the old policy's action choices based on its advantage values, these marginal values are added to the old objective sort of in a piece-wise fashion. So now if we look from a value-based perspective, the new gradient-updated policy can also be seen as induced from an argmax\text{argmax} procedure.

Hence, we've established the equivalency between policy iteration and policy gradient, but there's one issue that prevents us from using this derived expected value to optimize our policy: the advantage is measured under the old policy \pi, but the expectation is taken over the new policy π\pi'. Because the new policy is unknown, this prevents us from sampling batches under it to estimate the above expectation. Although, we do have the importance sampling trick that allows us to still use samples from the old/current policy, and add a distribution-ratio term to account for the policy mismatch between the old and the new:

Note there's a difference between marginal probability (pθ(s)p_{\theta'}(s)'s) ratio and policy distribution ratio (πθ(a)\pi_{\theta'}(a)). It might feel intuitive but actually requires some rigorous proof on why marginal probabilities of one state under two different policies are close if the two policy distributions are close. You can check out the famous (or infamous?) TRPO paper that gives a proof on bounding the probability mismatch with delta, a bound originally for distribution mismatch:

The neat thing about this bound is that it applies to both a deterministic policy and a general policy distribution, only the notion of "closeness" between policies is slightly different:

Now that we have a bound on state marginal probabilities, we can expand, lower-bound the objective value (more math omitted here), and thus increase the objective by maximizing its lower bound:

At this point, we can write policy gradient as a constrained optimization problem, so that you are essentially finding a new policy that's close enough to the current/old one so that the distribution mismatch can be ignored, but also increases the objective value:

In above we bound the distribution mismatch by a constant delta, but we can also obtain a more convenient bound for state marginals using KL Divergence between policies, an easier-to-approximate value, and solve this optimization problem with a different constraint instead:

How to optimize the objective?

1. Dual Gradient Descent

As shown in the new objective above, dual gradient descent incorporates the constraint into the objective by a weight λ\lambda, and alternatively update \lambda to make sure the constraint term is not violated too much.

2. Taylor expansion (on the objective)

Using first-order approximation, i.e. linearization, with Taylor expansion allows us to optimize a locally linear objective instead:

The nice thing about evaluating the above A-bar value at current parameter θ\theta is that the importance weight is gone, so we get a normal PG:

3. Natural Policy Gradient

Taylor expansion on the constraint Here by introducing F, the Fisher Information Matrix, we can directly update the policy parameters.

4. Trust Region Policy Optimization (PPO)

Avoids natural policy gradient's Fisher matrix calculation with advanced numerical methods to be better adapted to deep learning.

5. Proximal Policy Optimization (TRPO)

Uses the IS objective directly by adding regularization in gradient updates to stay close to old policy

All included screenshots credit to Lecture 9 (slides)

Last updated