Transfer and Multi-task Learning
Last updated
Last updated
To quote from the lecture: transfer learning is a machine learning problem that deals with using experience from one set of tasks for faster learning and better performance on a new task; specific to RL, we can define each task to be an MDP, and train agents that can do well on unseen (i.e. never-trained-on) tasks. Transfer learning in RL is still an active research area and has many sub-problems that vary in their task setups and problem formulations, so this section provides a broad overview of where they fit into the big picture, and since I've been pretty interested in this area lately, I'll follow up with some more detailed reading notes on the recommended papers in this section, see Reading Notes
The general framework for forward transfer in RL is to train the agent on one task, and aim at good performance on target, unseen tasks. Although the definition of the one see task can be blurry: sometimes the agent is trained on samples that are not exactly the same task, for example if you randomize the training process, the underlying MDP gets changed, so each time the agent is really trained on a slightly different task. But the general agreement in forward transfer is to put most effort in the training process, and hope it will guarantee a good testing performance without much interaction with the target tasks.
Finetuning is very popular in supervised deep learning, because as the example below, a lot of deep network's learned/extracted features are meaningful and can be reused to solve different tasks (bird v.s. dog? bird v.s. cat? etc.)
But in RL, this idea might not be directly transferable, because usually when a policy is trained to converge, it becomes pretty deterministic in taking actions, which means when facing a new task, actions that are good for the new task might not be considered and thus hurts the agent's exploration.
Finetuning via MaxEnt RL
Finetuning from transferred visual features
See paper DARLA: improving zero-shot transfer in reinforcement learning
With finetuning, the source and target domains are fixed, and we are mainly concerned with how to best apply source skills to the new target. But often we have some knowledge about how the target will be differ from the source, so an alternative approach is to design the source/training domain, so that the target domain would be a natural extension of it, and the agent can do well without even knowing this is a new task.
Randomize Dynamics: e.g. physical parameters
When the differences between source and target tasks are mainly in transition dynamics such as physical parameters, one thing we can do is to "enrich" the agent's trained experience with lots of different parameters, so that it will be robust to all possible parameters in the target tasks. More specifically, there are two ways to do this:
train the model to do well on all parameters
See paper EPOpt: Learning robust neural network policies
explicit train a recurrent model to predict the parameters
“Preparing for the Unknown: Learning a Universal Policy with Online System Identification
Sim-to-Real Transfer of Robotic Control with Dynamics Randomization
Randomize Observations: (mostly for vision-based RL)
Sometimes the underlying dynamics remain the same, only the source and target domain differ in observations (I think of this as different O but same S). In this case, data augmentation has proven to work really well and has the additional benefit of data efficiency
Data augmentation One downside of this, in my opinion, is that randomness from manually augmented data can be limited in both the sense that we can't predict all possible varieties in the domain, and that the model might still overfit to the manually augmented data.
The above method focus on enhancing the source domain to approximately "contain" the target without interacting with it, so it's still considered 0-shot transfer. On the other hand, domain adaptation methods allow the agent to take a look at the target domain, and adjust itself to adapt to it.
Domain Adaptation
These are mostly vision-based RL, and unsurprisingly, GANs become a common tool here:
1.a. Adversarial Adaptation, use discriminator Adapting Visuomotor Representations with Weak Pairwise Constraints
1.b. Turn synthetic images into realistic ones Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping Adversarial Adaptation, use generator transform simulation image to real-world-like images before training
Model-based Methods
Model Distillation
Contextual Policy (more in Meta-learning section)
Modular Networks
All included screenshots credit to Lecture 16 (slides)
One way to deal with this is to pre-train a policy that's random enough (high entropy), so that more actions are considered during the finetuning process. See the paper Reinforcement Learning with Deep Energy-Based Policies