One-shot imitation from video with recurrent comparator networks
Glen Berseth Glen Berseth
July 15 6 min

One-shot imitation from video with recurrent comparator networks

We would like an agent or software robot living in a simulation to be able to watch a single demonstration of a desired behaviour and replicate that behaviour. Using a new way to formulate spatial-temporal distance functions, we can now train reinforcement learning based agents to reproduce expert demonstrations by watching a single video example.

The example above is an agent that lives in a 2D world, but as we show below, our method also allows 3D virtual humanoid robots within a physics simulation environment to use our new Visual Imitation for RL technique to learn control policies for complex behaviours.

Visual Imitation with reinforcement learning using recurrent siamese networks.

Imitation learning — the ability to reproduce the behaviour of others, is a challenging and important problem in AI. Humans and many animals can understand and learn to produce new behaviours simply by observing others. Many current state-of-the-art techniques for imitation learning use additional data of the type that is not available in the real world. In mimicking movement, for example, past work has provided torque and other action information in addition to a subject’s observable joint positions.

In our recent work we describe a learning system that allows an agent to reproduce imitative behaviour of 3D simulated robots by simply watching a video, without any specialized model explicitly extracting parameters such as joint positions from the video.

Our approach involves learning a special type of neural network that can compare the raw video of the observed behavior to the behaviour being produced by the RL agent. This progress could enable us to create robots that can learn behaviour from observing humans, and allow humans to instruct robots in a more natural way: “I’ll show you how to do this”.

Reinforcement Learning

In AI, reinforcement learning or RL involves training agents to maximize rewards. Specifying what these rewards can be a challenging problem. In our work we specify the reward function using a distance computation, based on a special kind of neural network that compares between the agent’s perception of the desired behaviour and the agent’s generated behaviour.

Distance learning

In AI, imitation is often posed as a distribution matching problem where we want to minimize the distance between what an agent observes, the expert demonstration, and what the agent is doing, its actual behaviour.

If we have access to the expert's actions, the outputs used to affect the environment and achieve its goals, we can use semi-supervised learning. However, we rarely have access to such data in the real world. Instead, we use visual perceptions to observe the expert and have the agent attempt actions until what it reproduces matches what it observed.

Observable and imitation with the limited data available in the real world leads to two challenges: learning that important distance between the agent's behaviour and the expert given only video data, and enabling the agent to learn the actions necessary to match the expert.

While there has been work on imitation strategies from images for manipulation and 2D robots, addressing 3D imitation from video is an important milestone. Previous methods have made progress on imitation from images by learning a transformation of images such that in this transformed space, meaningful distances are available. Yet the problem of learning meaningful representations for planning or imitation is far from solved. The challenge is compounded in 3D imitation, as videos have limited information.

A critical aspect of imitation is that a motion has both an ordering and a speed. Walking uses two feet, two knees, and two hips, but you need to move them in the right order and at the right pace.

Imitation from sequences

Current imitation methods use spatial information to compute distances between images. These methods have worked well: given enough time and compute power, good policies can be learned. However, these methods may suffer from false negatives that occur when the agent is out-of-sync with the expert.

Imitation from sequences #1
Imitation from sequences #2
Imitation from sequences #3

In the above example we show a walking motion, followed by a walking motion played back at 1/4 speed, and lastly a fallen motion. Because of the limitation of spatial distance methods, a similar low reward will be given for those latter two examples, though the middle motion looks far more like a walk than the one on the right.

Our work uses the sequential structure of motion to better inform deep reinforcement learning and help address the limitations of spatial distance methods. Effectively, we learn two distance functions, one in space and another in time.

While the spatial distance function is designed to understand distances between pictures or poses, the time-based distance function understands if two motions look semantically similar. If the imitation goal is to walk, does the agent's behaviour also look like a walk? In effect, with this new abstraction, we can ask the question, does this motion look like a walk, not, does this motion look precisely like that walk. This distinction allows us to reward the agent for behaviour that is similar to the expert and may be at a different speed or time. This form of reward shaping was critical to learning good policies from video.

Comparator networks

To learn these distances, we train a recurrent comparator network, called a Siamese network in the literature, with positive and negative video examples. The positive examples are similar or from the same class, and the negative examples are known to be different or are from different classes. The model is trained to produce similar encodings when two videos or images are the same and different encodings otherwise. Additional data is included from other tasks to assist in training the network.

Comparator networks.

Results

Mean reward.

The addition of these new rewards using temporal distances (along with some additional insights) has enabled imitation learning of 3D motion imitation given only a single video demonstration. While these are the first results of its type, additional quality might be gained from multi-view video data or other multi-task data.

Imitation learning of 3D motion imitation.

For more, please read our paper on Arxiv. The video below shows our approach in action.

This work was produced at Element AI by research intern Glen Berseth and Principal Research Scientist Christopher Pal. Glen is now pursuing postdoctoral research at the University of California, Berkeley with Prof. Sergey Levine in the Berkeley Artificial Intelligence Laboratory.