Model-based Policy Learning

The open-loop style of planning and MPC, as discussed before, has a big problem of having to decide on a sequence of actions based solely on an initial state. But even if you have an accurate world transition model, most transitions are still stochastic, then starting from the same state, the next-step transitions could still yield big deviations from the trajectory that an agent predicts and plans from.

Therefore we should still want to learn a policy rather than a plan, so that we can choose actions based on most recent state observations. But it feels like a model-free algorithm should be able to handle this one-step, state-conditioned action decision, so in this section the main goal of model-based RL is to use a learned transition model to help improve policy learning. An idea for doing this is thus using the transition model to generate synthetic samples, then train derivative-free (“model-free”) RL algorithms on them. This “model-based acceleration” approach might seem a little backwards but works well.

Dyna and Dyna-style Algorithms

Classic Dyna

Steps 1~4 are essentially online Q-learning, only in addition to fitting a Q function it's also learning a transition model from the sampled experiences. The trick is in steps 5~7: it only samples state-action pairs (not transitions because no next-state is sampled), and use the world models (p and r) to predict a next state and its rewards, and fit Q function on these predicted transitions.

Dyna Style

As we can see, the more recent but Dyna-style algorithms also use the learned model to predict transitions, but the transition model and the Q function get updated in mini-batch style. Algorithms like Model-Based Acceleration (MBA), Model-Based Value Expansion (MVE), and Model-Based Policy Optimization (MBPO) all share this idea.

  • Why did new transitions generated by world model not overlap too much with actual data in the buffer?

Policies Simpler than Neural Nets

Distillation

For a classification task, train a single model not on deterministic labels, but on soft targets output by ensemble models.

We can use this idea to do policy distillation: supervised learning to improve multi-task learning, it doesn't use a policy network, but do weighted action based on policies for each task.

Guided policy search learns local LQR policies starting from different initial states, then fits a global policy and uses it to adjust rewards to again improve the local policies. Divide and Conquer RL does the same thing for local policies trained by deep RL algorithms such as TRPO.

All included screenshots credit to Lecture 12

Last updated