Policy Gradient Basics

first of the three-part series on policy gradient methods

This part is pretty math-heavy, it took me quite some time and multiple attempts to stare at the derivations and understand everything. Below I'll try to explain the math step-by-step in my own words, hopefully making it easier:

First, let's revisit the goal of RL written as an expectation of an objective that we are trying to maximize, see the second equation below:

Using a MDP model, we can assign a probability for any trajectory under a policy (parametrized by \theta), as shown in the first equation above. Then, since we can define (or induce from the environment) each state-action pair to have a corresponding reward scalar r, we can sum across all time steps in each trajectory to have a cumulated reward. Finally, for each policy, we can calculate an expected total reward by using these trajectory probabilities and rewards and taking the expectation over all possible trajectories, and try to find a policy that maximizes this expected reward (our RL objective).

Expectation is a theoretical definition, so in practice, we can approximate this objective as an average over many sampled trajectories under the same policy: $J(\theta) = E_{\tau \ \sim \ p{\theta}(\tau)} [ \sum_t r(s_t, a_t)] \approx \frac{1}{N} \sum_i \sum_t r(s_{i,t}, a_{i,t})$

Direct policy differentiation

Now, since we have a well-defined objective and a set of parameters $\theta$ that parametrizes our policy, one natural optimization method is to use gradient descent to "push" the policy to maximize the objective, hence "policy gradient". So our next-step would be finding the objective's gradient w.r.t. $\theta$ , below is a step-by-step derivation, notice the simple but smart tricks along the way:

First, we should understand this identity: $\pi_\theta(\tau) \nabla_\theta \log \pi_\theta (\tau) = \pi_\theta(\tau) \frac{\nabla_\theta \pi_\theta(\tau)}{\pi_\theta(\tau)} = \nabla_\theta \pi_\theta (\tau)$ Note: in RL we often use $\pi$ to represent a policy, but $\pi(\tau)$ is the trajectory probability under this policy, which is a bit counter-intuitive as compared to $p(\tau)$ , and took me a while to get used to. It's perhaps easier to read from right to left: the log function allows us to write the gradient of a function (here is $\pi$ ) as the function itself multiplied by gradient of its log. Using this trick, we can re-write the gradient term below, so that we can "take out" a copy of the original $\pi_\theta,$ and write the gradient of the objective expectation as a new expectation:

This way, the gradient can also be estimated by taking sample averages. Another convenient property for log probability is that we can write out joint probabilities as a sum, and ignore the terms that don't depend on $\theta$ (crossed out in red):
so now we only need to take gradient of this action probability conditioned on each state, instead of gradient of a trajectory probability.

The REINFORCE Algorithm

At this point, we are done with the math-y part, and ready to look at a simple policy gradient algorithm called REINFORCE:

At a high level, during each policy update step, we are essentially re-assigning each action's conditional probabilities, and make the "good stuff", i.e. high-reward state-action snippets, more likely than the "bad stuff". However, this direct differentiation has a big problem of high variance, as illustrated below:

Here, x-axis represents all possible trajectories, the vertical lines denote each trajectory's total reward, and the curves represent three policies' probability assignment over these trajectories. We can see at the first glance that some trajectories are relatively better, but for the non-positive reward ones, merely doing gradient updates will push our policy (the distribution curve) towards very negative trajectories.

Variance Reduction Methods

Below I'll briefly summarize the beginner tricks that reduce the high variance issue described above, in general, they all do some modifications on the objective $J(\theta)$ to make the gradient more stable, coding them up in the homework is a nice practice.

Reward-to-Go

The idea behind it is called causality: simply put, at each point in time, the policy only affects the future reward and not the previous ones, hence for each timestep $t,$ we can sum only the onward rewards starting from $t$ .

Baselines: optimal and practice

Since the absolute value of trajectory rewards play a big role and often causes high variance, intuitively we should have a better notion for not just good, but better than average rewards. Hence we subtract an average value from our reward term, and this value is called baseline. Note that we can use either

1) average reward: a "pretty good" baseline that's easier to implement

2) analyzing variance

By explicitly writing out the variance for the gradient of objective $J$ , we can derive this theoretically optimal baseline b, the expected reward weighted by gradient magnitudes. ( $g$ here represents gradient of $\log\pi$ ) I omitted some more math here that justifies the baseline idea by showing that subtracting a constant $b$ from the objective is unbiased in expectation but reduces the variance.

Off-policy PG & Importance Sampling

Another issue with REINFORCE policy gradient is that it needs the same policy to obtain a bunch of samples at step 1 before taking gradient and update it, this makes it an on-policy algorithm. But often this sample gathering process is expensive, and stepping in the environment under a bad policy can be costly.

One way to avoid this is using importance sampling and obtain samples under a different policy, so that we can reuse some previous samples to update the current policy.

Where to go from here

The main goal of the above section is to understand the math behind the basic policy gradient objective. Going from here, there are roughly two branches that can be seen as improving on different parts of the basic REINFORCE algorithm:

Actor-Critic Algorithms aim to lower the variance of policy gradients by fitting a value function, instead of reward-to-go or baseline, before evaluating the advantage of a policy. Hence Actor is the policy and Critic is the fitted value function V.

Advanced Policy Gradients follow up on the importance sampling method above, and generally try to use old policy state distribution to optimizes a bound.