Actor Critic Algorithms

Continuing from the basic policy gradients and variance-reduction methods, recall how we have a "reward-to-go" part in our RL objective that estimates the total reward for a trajectory. Aside from numerical variances mentioned before, this term could also be problematic if we are dealing with stochastic transitions or stochastic policies, which means, even if we use the same policy, the resulting trajectory starting from a same initial state could still have different total rewards, thus us sampling some sample trajectories and calculating rewards on those could be an inaccurate estimate of the true average reward that we need for better policy evaluation. Actor critic is a line of algorithms that aims to solve this issue by having a value estimator, instead of the naive summation of rewards, that learns to produce a better state-value estimate in the sampled trajectories.

Above let's review the closely-connected definitions of value functions Q, V, and A, and note how our RL objective can be rewritten by either one of them, i.e. using either one will give us an evaluation of the policy's induced trajectories, so that we can fit a best-possible value function to improve our policy gradients.

Note how: 1) we can rewrite and approximate Q and A with V; and 2) by its definition, V evaluated on each state is a cumulated reward from that state onward in time. Therefore, we can fit just the V function, and our objective J can be written as expectation of V(s)'s evaluated on the first states of each trajectory.

p.s. classic Actor-Critics fit V, which is intuitively easier since it only depends on a state, while Q functions depend on state-action pairs, but its still possible to fit Q instead:

Now we can instead look at different ways to fit this V function: $\text{supervised regression}: \mathcal{L} = \frac{1}{2} \sum_i|| \hat{V}_{\phi}^\pi(\bold{s_i}) - y_i ||^2$

again, there are some trajectory samples available to us, and we can make this a supervised regression problem by training a neural network as the function approximator for V, and fit it to best approximate the estimated V-values from our samples. But there are different options when it comes to labeling the y-values before training the V-network on it:

Monte Carlo Evaluation simply uses a summation over time to label y, note how this is sub-optimal because for each state, we really are just using one trajectory's transition reward to approximate all the possible trajectories starting from this same state, so ideally the label should be the first equation in the picture below:

Bootstrapped Estimate is better than Monte Carlo if we consider the "one trajectory fits all possible" issue above, because it uses the previous value function to estimate future state's values, which is somewhat less affected by the lack of samples. Notice how the only different between the two methods is how we define y in the regression data.

Discount Factors: One big issue with the above methods is that the trajectory rewards will get infinitely big if the tasks have infinite horizons, so intuitively we avoids this issue by applying a discount factor to prioritize rewards closer over those further in the future.

Be careful with how exactly to implement the degree of $\gamma$ discounts, option 1 is what we should use.

More Options on Implementation-level

Architecture Design

Under this actor-critic framework, we have a choice in how to construct deep networks:

Make two separate networks, one for approximating V, thus called "critic", and the other for policy (selecting action based on state input), i.e. the "actor". Note that both V and \pi takes in state s as input, so when dealing with high-dimensional state space we might want to share some low-level representations between actor and critic, thus
Use one shared network that's trained to spit out both value and action: this could be tricky when tuning learning rates and doing back-prop, because $V$ and $\pi$ might have very different magnitudes in gradients, it takes care to balance the two.

Parallel Workers

This is pretty straight-forward: because we are updating the critic neural net in an online fashion, i.e. updating the network with samples collected under the most current policy, it's better to feed a batch of samples collected by parallel workers executing the same policy in a bunch of parallel environments. Also there's a choice between synchronous and asynchronous in implementation: (asynchronous is easier and faster)

Reduce the variance even more: n-step return and GAE

n-step return comes from the intuitive assumption that, starting from a single state there are many possible trajectories, and among them, short-term rewards should be similar, but rewards from further in the future will have bigger variance. Therefore when estimating state-values at time $t$ , we can pool over $V$ 's from multiple steps ahead: (it's a combination of Monte-Carlo and Bootstrap estimations)

Generalized advantage estimation (GAE) follows up on the n-step-return idea by using a weighted combination of n-step returns from different n values: $\hat{A}_{\text{GAE}}^\pi (s_t, a_t) = \sum_{n=1}^{\infty} w_n \hat{A}_n^\pi (s_t, a_t)$

In above, weight scalars $w_n$ can be implemented proportional to an exponential falloff, and it can be shown that this can be simplified and see as using a "discount" weight on advantages using different n's.

All included screenshots credit to Lecture 6 (slides)

PreviousPolicy Gradient Basics NextAdvanced Policy Gradients

Last updated 4 years ago