Model-based Planning and Model-based Predictive Control
Last updated
Last updated
Recall our state-transition model from the intro to RL part earlier in the course:
Up until now, we've been learning about model-free methods, i.e. making action decisions based only on state observations, and agnostic about exactly how the state-action pairs are transitioning into the one another. But arguably it's still nice to learn these transition dynamics -- if it's not immediately obvious why knowing it will help learning our RL policy, at least it's intuitive that knowing how the state-action will transition will allow us to do open-loop planning for a task even before actually interacting with the environment. So as an intro to model-based RL, let's look at these open-loop, model-based planning methods first.
In the open-loop setting, control and planning are the same because we are just finding a sequence of actions at an initial state (of an episode of a trajectory), and execute it with our eyes closed. Also note that, not to be confused with deterministic or stochastic policies, the deterministic vs stochastic cases in this part differ in assumptions about transition dynamics: even if we know the transition distribution (p(x) in above figure), there could still be uncertainty in exactly which next-state is going to be if we are running in a stochastic MDP.
We are choosing a sequence of actions (either for each timestep, or write to denotes a sequence)
Simply sample randomly (e.g. uniformly) many candidate action sequences A's, then evaluate them to choose the for objective . Sampling candidate actions allows a decent coverage of all possible actions, and this process can easily be done in parallel.
Instead of sampling from the same distribution like Random Shooting does, CEM updates that distribution based on the sampled action sequences that yield higher reward, i.e. the 'elites'. CEM preserves advantages of 1) and additionally improves the sampling process, but although simple and parallelizable, both methods suffer if we have a high-dimensional action space (action sampling won't provide sufficient coverage), or having stochastic transitions (open-loop doesn't adjust actions based on environment feedbacks).
The intuition here is that, an action is better than others if: we choose it first, then follow up with some default, possibly suboptimal/random policy, then this action sequence leads to a better reward. Thus MCTS does a shallow tree search to try different actions in the first steps, then execute a default policy until the end, to gather an evaluation for the branches of early actions. Obviously this would work better for discrete action space. I don't think the sketched algorithm is too hard to understand, but note the difference between TreePolicy and DefaultPolicy. The key insight is updating TreePolicy (for action choosing) with the Score value that changes based on both the reward value and number of times an action gets explored, to provide a good exploration for actions.
In LQR, we model the deterministic transition dynamics as linear, and cost/reward function as quadratic. There's quite a lot of algebra derivation that in the end gives us a nice recursive solution of the problem, you could watch the lecture for details, and LQR was also covered in EECS127.
We can actually extend LQR to stochastic dynamics: here the next state x comes from a Gaussian distribution parametrized by f. The solution for this Gaussian transition is the same as deterministic LQR despite the variance matrix. Also if we want a close-loop version of stochastic LQR, the solution is the same only with:
Note that all these methods above are under the "shooting method" category, which means we are optimizing an objective over only action choice; there does exist another line of methods called "collocation method", which optimizes both state and actions.
All included screenshots credit to Lecture 10
Continue from the previous part, the basic recipe for model-based RL is similar to system ID: trying to learn a transition model, so that we can plan through it and find the best action sequences/policy. Skipping v0.5 and v1.0, I'll try to explain this v1.5 model-based algorithm that actually could work:
The biggest difference between the above and model-free algorithms is obviously the supervised step where we try to fit a dynamics model that predicts state-action transitions. Note the inner arrow: only the first action in step#3's plan got executed in step#4, and the inner iteration immediately replans before every step, this is to avoid distribution shift under stochastic dynamics, since we don't know exactly which state is coming next, actions planned for later might not be optimal. Also notice the outer arrow: to make sure our transition model stays relevant as we execute actions and explore more, every once in a while we still need to do step#1. Technically, only step#4 is called MPC: we are literally doing control using our transition model's prediction.
We can use the planning methods discussed in the previous section (lecture#10) for step#3, whichever is least expensive. Since we are replanning for every single step, the individual action sequences we propose don't need to be too perfect, and often the random shooting method or CEM works well.
One problem with v1.5 algorithm is that the neural net tends to overfit to seen dynamics, thus having very inaccurate expected reward for the unseen ones.
One approach to account for this issue is to incorporate uncertainty in our models. Using output entropy isn't going to work, because if the model overfits, the very inaccurate output still will have low entropy. Out problem isn't with aleatoric or statistical uncertainty, which means the model is good but the data itself is noisy so it's uncertain, but with epistemic or model uncertainty, i.e. the model is certain but we are uncertain about how good the model is (as in the graph above). Mathematically, model uncertainty can be written as the probability of its parameter conditioned on a given dataset:
Bayesian Neural Networks is a way to approximate this conditional distribution: instead of scalar weights, its neurons are connected with Gaussian distributions, so we can approximate the overall parameter probability by multiplying per-connection distributions:
Bootstrap Ensembles is another attack at the same problem: it simply trains an ensemble of multiple models, and measure uncertainty by looking at how much the models' choices agree/disagree on the same data. Ideally we'd want these ensemble-d models to train independently on different datasets, thus drawn data from the initial set with replacement, but in deep learning this isn't usually necessary, because SGD and random initialization usually makes the models sufficiently independent.
Then we can use ensemble models to plan with uncertainty: instead of picking action that maximizes subsequent rewards, for each action, sample from multiple state transitions, then calculate an average reward across all states sampled from different transition dynamics.
For now we introduce single-step encoder for deterministic case: we just replace state s with encoded observation :
All included screenshots credit to Lecture 11