Inverse Reinforcement Learning

Following up on the suboptimal actions model introduced in the previous lectures, in this section we define an inverse reinforcement learning (IRL) problem and explain how the probabilistic models we've seen can be used to derive IRL algorithms.

The biggest difference between IRL and RL problem is that, instead of given to us as a mapping from state-action to scalar rewards, we now have to learn the reward function ourself. Some motivations for this could be that, learning the intention of an expert can help imitation learning agents to reason about action choices, avoid explicitly defining reward for complex tasks in real-world, etc. However, many reward functions can explain the same behavior, so to formally define the problem, we can specify a reward function form such as linear combination of features extracted from state-actions, or neural network approximators.

Feature Matching

Linear reward function means, if we have different f functions that extract features from an state-action pair, and assume its reward is a linear combination of these features. Under this assumption, if we have sample transitions from some expert that operates under an unknown but optimal policy, we can recover a reward function feature matching:

As shown above, for any reward function rr parametrized by ψ\psi, we can induce an optimal policy πr\pi^r by maximizing step-wise rewards. So to find the best reward function among them, we can find that one reward whose induced optimal policy yields expected feature values that match the expert samples' expected features.

But there might be multiple reward functions that achieve this match, how exactly can we find the best one? We can review our classification materials from ML and use the maximum margin principle to find the reward function that does the best job in "distinguishing" the expert policy from the rest. Recall SVM, we can get rid of the explicit margin value m, and solve a constrained optimization problem by using a policy mismatch metric denoted as DD:

This feature matching method has several issues: finding the best reward function by maximizing the margin is a bit arbitrary; a lack of a clear model for expert sub-optimality; messy constrained optimization problem difficult to be solved via deep learning, etc.

Learning Reward in Probabilistic Decision Making Model

Recall in the previous note/lecture, a probabilistic graphical model with optimality variables added in can model suboptimal behaviors, and we've discussed how to do inference on these models. Remember how the optimality nodes are defined as exponential of rewards of state-action marginals? We can thus start from here to learn a reward function under this model. One way to do that is called maximum likelihood learning, which finds the reward function's parameters ψ\psi by maximizing the likelihood of sampled trajectories under this reward and assumed all-time optimality. Additionally, we can ignore the p(τ)p(\tau), the "physically feasible" probability, and simplify the sum of trajectory likelihoods down to a sum of trajectory rewards minus a partition function log(Z)\log(Z).

Bear with more math here: the below derivation arrives at this neat interpretation that, the gradient of the average likelihood, here denoted by L, is the difference between two expectations, and they are the same values but expectation taken over two different policies. So going from here, we can focus on how to estimate these two values.

Max-Entropy Algorithm

One way to approximate the second term in the above-derived expectation is to use backward and forward messages introduced in the previous note/lecture, and use them to calculate an estimate of state-action visitation probability (denoted with \mu below). This is called MaxEnt IRL algorithm:

The reason for the name "MaxEnt" is, with a linear reward function, it can be shown that the algorithm is actually finding the max-entropy policy that satisfies the feature matching constraint, like discussed in the above section.

Importance Sampling

MaxEnt IRL requires enumerating all sampled state-action tuples to during the gradient update step, thus it doesn't do well when we have large and continuous state and action spaces, and it requires known transition dynamics, which isn't always available with sampled transitions. Recall how important sampling allows a distribution mismatch adjustment so we can sample under a different policy, using IS we have an alternative for a sample-efficient update in IRL:

So we can use the same reward gradient from sampled trajectories, and only do a re-weighting step to adjust the second term to approximate the soft optimal policy induced under the current reward function. The importance weights w, can be expanded and simplified, like shown above, into a simple ratio between current reward function's outputs and current policy's action choices. Guided cost learning uses this idea and additionally throws in a neural network to approximate and learn the reward function.

IRL as Generative Adversarial Networks

If we look at the above procedure with an adversarial view: one part of the algorithm is trying to improve the learned policy to close the gap between expert and learned policy, while the other part is trying to update the reward function to distinguish between the two. This collides with the idea of Generative Adversarial Networks (GAN).

Put under the generator-discriminator framework, we can write a discriminator loss for the above guided cost learning, or simply use a standard binary neural net classifier (omitting detailed explanation for now).

All included screenshots credit to Lecture 15 (slides)

Last updated