Paper Reading Notes

  • Overview

    They proposed a way to transfer learning across 2D mujoco source and target domains, which differ only in physical parameters. Using 2 alternating phases during training s.t. the policy is robust across all parameters in the source distribution and the source distribution approximates/contains the parameters in the target domain.

  • Algorithm Sketch

    Alternating phases during training:

    1. Sample N=240 parameters/models with prior P, and optimize the policy using TRPO to perform well across each model’s generated trajectories; to make the policy “robust”, simply focus on optimizing over the poorest trajectories.

    2. Get some trajectory from the target domain, and update prior P to adapt the source distribution of parameters to better approximate the unknown parameters in the target. Not fully understanding the probability proof here, but the idea is to use importance sampling and compute the likelihood of target trajectories given a sampled source parameter.

  • Notes This method is limited to the assumption that the only difference between source and target task domains is the physical parameters, and it works best when the varying parameters are explicitly modeled in the source distribution. Using only 1) should be sufficient (i.e. no source domain adaptation) if the source distribution is “broad” enough that the target is very similar; adapting the source with 2) is intuitively expensive but it works well when the source/target mismatch is still “model-able”, and this makes the method few-shot transfer since it needs to gather trajectories from target domain as well.

Last updated