ADAP

Adaptable Agent Populations via a Generative Model of Policies

in Reinforcement Learning

Kenneth Derek
MIT CSAIL

Phillip Isola [web]
MIT CSAIL

Overview



Image

Learn not one, but many, ways to succeed in an environment - while still using shared weights.




Image

By learning a diverse policy space in an open-ended environment with many solutions, we show that we can adapt to future ablations simply by selecting one of our many learned "species" - without changing parameter weights.


Image

We also learn a policy space in a competitive, two-player, zero-sum game. Here, no single deterministic policy is optimal against all adversaries. On the other hand, our ADAP policy space can be naturally adaptable to a wide array of both challenging and naive adversaries.



Overall, we hope to show how it can be beneficial in RL settings to optimize not just for reward, but also for diversity of behavior. As environments continue to increase in complexity and open-endedness -- filled with branching paths to success -- it makes sense to learn not just one, but many, solutions.





The Niche Specialization Experiment (Section 4)

Recall from the paper that in this experiment, we enforce that agents can attack only one of chickens or towers. Agents can initially choose either one, but once they make a choice, they are locked in. Lastly, agents do not see which object they are locked into. This means that a "single" policy paradigm would result in either a "jack-of-all-trades" agent, or an agent that only attacks one particular unit type. But by instead learning a policy space, we can find individually specialized policies that, in aggregate, harness all resources.

Visualization

Agents that locked into towers are boxed in green, and agents that are locked into chickens are boxed in red. If a chicken-specialized agent attacks a tower, or vice versa, we flash the agent in purple, and increment the counter of blunders. A high number of blunders indicates that the general agent population does not specialize very well.

We are looking at this for a qualitative demonstration of behavior, and are not concerned too much with the numbers. However, it is clear to see that ADAP keeps the number of blunders relatively low. Additionally, in ADAP, there is a small subset of agents that blunder, indicating it has learned a multi-modal policy space with specialized agents. This means that we can optimize out these erroneous agents. In Vanilla and DIAYN*, all agents tend to blunder with equal chance, and cannot be optimized out.

ADAP

Image

Vanilla

Image

DIAYN*

Image

Adapting to Ablations


Now that we have visualized how ADAP learns a generator with a specialized and multi-modal policy space, we can look at how to use this policy space to adapt to different environmental ablations.

These ADAP policies were not trained to solve these ablations specifically! It turns out that by learning a diverse policy space on the original, open-ended train environment, ADAP learned a policy "species" that was implicitly well-suited for most ablations. For example, ADAP learned species that "prefer" to go to the bottom right, or wait until their health is low before farming food.

Again, here we are primarily concerned with qualitative results - we quantify variance in the paper. We show results after latent space optimization onto each ablated environment.

ADAP

Image

Vanilla

Image

DIAYN*

Image

ADAP

Image

Vanilla

Image

DIAYN*

Image

ADAP

Image (reduced video quality for memory compression)

Vanilla

Image

DIAYN*

Image

ADAP

Image

Vanilla

Image

DIAYN*

Image

ADAP

Image

Vanilla

Image

DIAYN*

Image

Adapting to Adversaries


As you know, ADAP also exhibits robustness to a variety of adversaries in the zero-sum two-player game of Markov Soccer. We provide a summary of different policies that ADAP emergently learned during its self-play curriculum below.

Strategies for Markov Soccer

Before diving into looking at specific episodes of Markov Soccer, here are some tactics we have observed ADAP is generally able to find, without ever explicitly training against. Hopefully, these may help to aid in understanding the types of policy diversity learned by ADAP, as well as explain ADAP's superior performance to the Vanilla baseline.

We indicate possession by shading with a light grey square.

Image

Oscillate 1: B can go up and around the bot A, rather than charging through A's defense.


Image

Stand: B simply needs to deterministically move towards the non-guarded square. We've noticed that ADAP tends to be more committed, whereas the Vanilla baseline may attempt some fake-out moves. A more committed approach is better, since it reduces the chance that the game ends in a draw before scoring.


Image

Straight: A will deterministically move right, so if B is able to stand in front of A, possession will switch (indicated by the star) and B can sneak around A into the goal. In some cases of the Vanilla rollout, Vanilla remains stuck behind A, rather than sneaking out and around.


With that being said, enjoy looking at sample rollouts collected from learning G with ADAP (multiplicative) and Vanilla (multiplicative) methods. We hope it provides insight into the bot behaviors as well as the behaviors of G. To view a new random episode, just click on the GIF.

Player A is the Bot, and Player B is a sample chosen from the learned policy space.

ADAP

Image
Sample 0

Vanilla

Image
Sample 0

ADAP

Image
Sample 0

Vanilla

Image
Sample 0

ADAP

Image
Sample 0

Vanilla

Image
Sample 0

ADAP

Image
Sample 0

Vanilla

Image
Sample 0

ADAP

Image
Sample 0

Vanilla

Image
Sample 0

ADAP

Image
Sample 0

Vanilla

Image
Sample 0

Toy Experiments


We couldn't help but include some qualitative results of common toy environment settings. Take a gander at these environments below.

Multi-Goal Evaluation


We present a discrete multi-goal environment: 40 agents start out in the center of the map (gold diamond), and must navigate to one of four gold stars at the corners. There is one small twist: each star only has 10 units of food, and agents consume 1 unit of food each.

  • Agent Observation: one hot encoding of agent's (x, y) position
  • Episode length: 20
  • Reward: 0 at timesteps 0-19. 1 at timestep 20 if agent found food, 0 otherwise.

Visualization

We visualize two seeds each of the results of ADAP, DIAYN*, and Vanilla, using both concatenation and multiplicative models. Each agent trajectory is shown as a series of connected line segments, representing the series of steps in its trajectory. The RGB color of the agent trajectory is directly proportional to its latent z.

Summary of Results

  • ADAP: Mult. model learned to reach all 4 goals consistently.
  • DIAYN*: Both models reached all four goals. Suprisingly, the concat. model was better at finding consistently diverse policy space.
  • Vanilla: Mult. model generally failed to reach more than 2 goals, resulting in low overall rewards. Concatenation model fared even more poorly.
  • Vanilla (higher entropy): High entropy policies do not imply diverse policies, and this method generally only reached 1 goal.


Mult Model
Image
Image
Concat Model
Image
Image
Mult Model
Image
Image
Concat Model
Image
Image
Mult Model
Image
Image
Concat Model
Image
Image
Mult Model
Image
Image
Concat Model
Image
Image

CartPole


Here, we have qualitative results for CartPole, under normal conditions, and various ablations.
  • Normal: the normal CartPole conditions
  • Noisy Observations: Random uniform noise added to observations.
  • Reward for left and right sides: Ablated reward function, where reward for a timestep is equal to the negative or positive absolute value of the x-position for left and right respectively.

Summary of Results

As in prior experiments, we optimize for the ablations entirely in the latent space distribution. ADAP, Vanilla, and DIAYN* all performed nearly perfectly under Normal and Noisy conditions. On the other hand, only ADAP and DIAYN* were able to produce policies that explored the left and right half of the x-axis.

(Best viewed in Safari or Firefox - Chrome seems to have issues with these GIFs)

ADAP

Image

Vanilla

Image

DIAYN*

Image

ADAP

Image

Vanilla

Image

DIAYN*

Image

ADAP

Image

Vanilla

Image

DIAYN*

Image