Control as Inference

Probabilistic Model for Behaviors

Key idea: the probability of a trajectory is proportional to expected rewards.

Below, let's first look at three values we can compute/infer from this model and then use them to formulate inference as an RL problem.

Backward Messages

We can think of computing backward messages as inferring what's the probability for onward optimality starting from the current state and action in the trajectory, i.e. probability of all onward optimality variables having value 1. After expanding this definition we can separate this conditional probability to the multiplication of several probabilities, and notice how the onward optimality probability conditioned on only one state can be written as an expectation over action. The math can be a little tricky to wrap your head around, but this derivation enables us to calculate backward messages in a recursive way:

I'm omitting some more math here that basically shows the action prior distribution doesn't affect the formulation because it can always be folded into the reward. So above we assumed uniform action prior without loss of generality.

Optimal Policy

Forward Message

Defined as probability of a state given up-until-now optimal actions, it can be expanded out, again using chain rule of conditional probability, so that it can be calculated recursively from the beginning state. The first set of long equations below are essentially showing how we can using gathered known quantities to calculate forward message; and using both forward and backward messages, we can actually calculate the probability of a state (i.e. state marginals) under overall optimality, which is proportional to them multiplied together:

Probabilistic RL

To begin with, let's see why and how variational inference, as introduced in the previous lecture/note, can help us recover the optimal policy under the new model discussed above.

Recall how we've been setting optimality variables to all true and put them as given conditions/evidences when we calculate posterior action or state probabilities. But while this "evidence" allows us to calculate the best action under given optimality, it also affects the state transition dynamics:

This makes sense from an inference perspective: given you are under optimal policy, the high-reward next states are more likely to come up; but not for control: what we want is to select the best actions assuming the state transition dynamic is the same. Recall the idea of variational inference, we can learn a model that approximates a posterior distribution.

Notice how this q-distribution is supposed to do two things: it both gives the posterior probability of any trajectory under optimality, and yields the approximate transition dynamics probability if conditioned on a current state-action pair. And to achieve these two, we choose a form for it:

So now we can draw a new transition model as below. Notice how the q distribution preserves the transition dynamics in the original optimality model, but allows us to omit the script-O nodes because it's already conditioned on optimality to the action choices forms an optimal policy.

Furthermore, this soft value iteration has variants that allows discounted expected V's and explicit temperature that weights the V-function towards a hard-max, to control the stochasticity as desired:

I'll stop the notes here for now, but in the rest of this lecture we can see modified RL algorithms with soft optimality added to the original RL objectives, and stochastic models for learning control.

All included screenshots credit to Lecture 14

Last updated