Kenneth Derek
MIT CSAIL

Phillip Isola [web]
MIT CSAIL

Learn not one, but many, ways to succeed in an environment - while still using shared weights.

By learning a diverse policy space in an open-ended environment with many solutions, we show that we can adapt to future ablations simply by selecting one of our many learned "species" - without changing parameter weights.

We also learn a policy space in a competitive, two-player, zero-sum game. Here, no single deterministic policy is optimal against all adversaries. On the other hand, our ADAP policy space can be naturally adaptable to a wide array of both challenging and naive adversaries.

the following sections supplement the main paper

A Space of Specialized Agents

The Niche Specialization Experiment (Section 4)

Recall from the paper that in this experiment, we enforce that agents can attack only one of chickens or towers. Agents can initially choose either one, but once they make a choice, they are locked in. Lastly, agents do not see which object they are locked into. This means that a "single" policy paradigm would result in either a "jack-of-all-trades" agent, or an agent that only attacks one particular unit type. But by instead learning a policy space, we can find individually specialized policies that, in aggregate, harness all resources.

Visualization

Agents that locked into towers are boxed in green, and agents that are locked into chickens are boxed in red. If a chicken-specialized agent attacks a tower, or vice versa, we flash the agent in purple, and increment the counter of blunders. A high number of blunders indicates that the general agent population does not specialize very well.

We are looking at this for a qualitative demonstration of behavior, and are not concerned too much with the numbers. However, it is clear to see that ADAP keeps the number of blunders relatively low. Additionally, in ADAP, there is a small subset of agents that blunder, indicating it has learned a multi-modal policy space with specialized agents. This means that we can optimize out these erroneous agents. In Vanilla and DIAYN*, all agents tend to blunder with equal chance, and cannot be optimized out.

ADAP

Vanilla

DIAYN*

Adapting to Ablations

Now that we have visualized how ADAP learns a generator with a specialized and multi-modal policy space, we can look at how to use this policy space to adapt to different environmental ablations.

These ADAP policies were not trained to solve these ablations specifically! It turns out that by learning a diverse policy space on the original, open-ended train environment, ADAP learned a policy "species" that was implicitly well-suited for most ablations. For example, ADAP learned species that "prefer" to go to the bottom right, or wait until their health is low before farming food.

Again, here we are primarily concerned with qualitative results - we quantify variance in the paper. We show results after latent space optimization onto each ablated environment.

ADAP

Vanilla

DIAYN*

ADAP

Vanilla

DIAYN*

ADAP

(reduced video quality for memory compression)

Vanilla

DIAYN*

ADAP

Vanilla

DIAYN*

ADAP

Vanilla

DIAYN*

Adapting to Adversaries

As you know, ADAP also exhibits robustness to a variety of adversaries in the zero-sum two-player game of Markov Soccer. We provide a summary of different policies that ADAP emergently learned during its self-play curriculum below.

Strategies for Markov Soccer

Before diving into looking at specific episodes of Markov Soccer, here are some tactics we have observed ADAP is generally able to find, without ever explicitly training against. Hopefully, these may help to aid in understanding the types of policy diversity learned by ADAP, as well as explain ADAP's superior performance to the Vanilla baseline.

We indicate possession by shading with a light grey square.

Oscillate 1: B can go up and around the bot A, rather than charging through A's defense.

Stand: B simply needs to deterministically move towards the non-guarded square. We've noticed that ADAP tends to be more committed, whereas the Vanilla baseline may attempt some fake-out moves. A more committed approach is better, since it reduces the chance that the game ends in a draw before scoring.

Straight: A will deterministically move right, so if B is able to stand in front of A, possession will switch (indicated by the star) and B can sneak around A into the goal. In some cases of the Vanilla rollout, Vanilla remains stuck behind A, rather than sneaking out and around.

Sample 0

Toy Experiments

We couldn't help but include some qualitative results of common toy environment settings. Take a gander at these environments below.

Multi-Goal Evaluation

We present a discrete multi-goal environment: 40 agents start out in the center of the map (gold diamond), and must navigate to one of four gold stars at the corners. There is one small twist: each star only has 10 units of food, and agents consume 1 unit of food each.

Agent Observation: one hot encoding of agent's (x, y) position
Episode length: 20
Reward: 0 at timesteps 0-19. 1 at timestep 20 if agent found food, 0 otherwise.

Visualization

We visualize two seeds each of the results of ADAP, DIAYN*, and Vanilla, using both concatenation and multiplicative models. Each agent trajectory is shown as a series of connected line segments, representing the series of steps in its trajectory. The RGB color of the agent trajectory is directly proportional to its latent z.

Summary of Results

ADAP: Mult. model learned to reach all 4 goals consistently.
DIAYN*: Both models reached all four goals. Suprisingly, the concat. model was better at finding consistently diverse policy space.
Vanilla: Mult. model generally failed to reach more than 2 goals, resulting in low overall rewards. Concatenation model fared even more poorly.
Vanilla (higher entropy): High entropy policies do not imply diverse policies, and this method generally only reached 1 goal.

Mult Model

Concat Model

Mult Model

Concat Model

Mult Model

Concat Model

Mult Model

Concat Model

CartPole

Here, we have qualitative results for CartPole, under normal conditions, and various ablations.

Normal: the normal CartPole conditions
Noisy Observations: Random uniform noise added to observations.
Reward for left and right sides: Ablated reward function, where reward for a timestep is equal to the negative or positive absolute value of the x-position for left and right respectively.

Summary of Results

As in prior experiments, we optimize for the ablations entirely in the latent space distribution. ADAP, Vanilla, and DIAYN* all performed nearly perfectly under Normal and Noisy conditions. On the other hand, only ADAP and DIAYN* were able to produce policies that explored the left and right half of the x-axis.

(Best viewed in Safari or Firefox - Chrome seems to have issues with these GIFs)

ADAP

Adaptable Agent Populations via a Generative Model of Policies

in Reinforcement Learning

Overview

the following sections supplement the main paper

A Space of Specialized Agents

The Niche Specialization Experiment (Section 4)

Visualization

ADAP

Vanilla

DIAYN*

Adapting to Ablations

ADAP

Vanilla

DIAYN*

ADAP

Vanilla

DIAYN*

ADAP

Vanilla

DIAYN*

ADAP

Vanilla

DIAYN*

ADAP

Vanilla

DIAYN*

Adapting to Adversaries

Strategies for Markov Soccer

ADAP

Vanilla

ADAP

Vanilla

ADAP

Vanilla

ADAP

Vanilla

ADAP

Vanilla

ADAP

Vanilla

Toy Experiments

Multi-Goal Evaluation

Visualization

Summary of Results

CartPole

Summary of Results

ADAP

Vanilla

DIAYN*

ADAP

Vanilla

DIAYN*

ADAP

Vanilla

DIAYN*

ADAP

Vanilla

DIAYN*