Policy Gradient Basics

first of the three-part series on policy gradient methods

This part is pretty math-heavy, it took me quite some time and multiple attempts to stare at the derivations and understand everything. Below I'll try to explain the math step-by-step in my own words, hopefully making it easier:

First, let's revisit the goal of RL written as an expectation of an objective that we are trying to maximize, see the second equation below:

Using a MDP model, we can assign a probability for any trajectory under a policy (parametrized by \theta), as shown in the first equation above. Then, since we can define (or induce from the environment) each state-action pair to have a corresponding reward scalar r, we can sum across all time steps in each trajectory to have a cumulated reward. Finally, for each policy, we can calculate an expected total reward by using these trajectory probabilities and rewards and taking the expectation over all possible trajectories, and try to find a policy that maximizes this expected reward (our RL objective).

Direct policy differentiation

  • so now we only need to take gradient of this action probability conditioned on each state, instead of gradient of a trajectory probability.

The REINFORCE Algorithm

At this point, we are done with the math-y part, and ready to look at a simple policy gradient algorithm called REINFORCE:

At a high level, during each policy update step, we are essentially re-assigning each action's conditional probabilities, and make the "good stuff", i.e. high-reward state-action snippets, more likely than the "bad stuff". However, this direct differentiation has a big problem of high variance, as illustrated below:

Here, x-axis represents all possible trajectories, the vertical lines denote each trajectory's total reward, and the curves represent three policies' probability assignment over these trajectories. We can see at the first glance that some trajectories are relatively better, but for the non-positive reward ones, merely doing gradient updates will push our policy (the distribution curve) towards very negative trajectories.

Variance Reduction Methods

Reward-to-Go

Baselines: optimal and practice

Since the absolute value of trajectory rewards play a big role and often causes high variance, intuitively we should have a better notion for not just good, but better than average rewards. Hence we subtract an average value from our reward term, and this value is called baseline. Note that we can use either

1) average reward: a "pretty good" baseline that's easier to implement

2) analyzing variance

Off-policy PG & Importance Sampling

Another issue with REINFORCE policy gradient is that it needs the same policy to obtain a bunch of samples at step 1 before taking gradient and update it, this makes it an on-policy algorithm. But often this sample gathering process is expensive, and stepping in the environment under a bad policy can be costly.

One way to avoid this is using importance sampling and obtain samples under a different policy, so that we can reuse some previous samples to update the current policy.

Where to go from here

The main goal of the above section is to understand the math behind the basic policy gradient objective. Going from here, there are roughly two branches that can be seen as improving on different parts of the basic REINFORCE algorithm:

Actor-Critic Algorithms aim to lower the variance of policy gradients by fitting a value function, instead of reward-to-go or baseline, before evaluating the advantage of a policy. Hence Actor is the policy and Critic is the fitted value function V.

Advanced Policy Gradients follow up on the importance sampling method above, and generally try to use old policy state distribution to optimizes a bound.

I'll stop here and write more about the two branches in the next two notes:

All included screenshots credit to lecture 5 (slides)

Last updated