3

I'm using ray RLlib library to train multi-agent Trainer on the 5-in-a-row game. This is zero-sum environment so I have a problem of agents behavior degeneration (always win for 1'st agent, 5 moves to win). I have an idea to change learning rate of the agents in a that way: first train the first agent, left second as random with learning rate equal to zero. After first agent learns how to win more than 90% games switch. Then repeat But I can't change learning rate after its initialization in constructor. Is this possible?

def gen_policy(GENV, lr=0.001):
    config = {
        "model": {
            "custom_model": 'GomokuModel',
            "custom_options": {"use_symmetry": True, "reg_loss": 0},
        },
        "custom_action_dist": Categorical,
        "lr": lr
    }
    return (None, GENV.observation_space, GENV.action_space, config)

def map_fn(agent_id):
    if agent_id=='agent_0':
        return "policy_0"
    else:
        return "policy_1"

trainer = ray.rllib.agents.a3c.A3CTrainer(env="GomokuEnv", config={
        "multiagent": {
            "policies": {"policy_0": gen_policy(GENV, lr = 0.001), "policy_1": gen_policy(GENV,lr=0)},
            "policy_mapping_fn": map_fn,
            },
        "callbacks":
            {"on_episode_end": clb_episode_end},


while True:
    rest = trainer.train()
    #here I want to change learning rate of my policies based on environment statistics

I've tried to add these lines inside while True loop

new_config = trainer.get_config()
new_config["multiagent"]["policies"]["policy_0"]=gm.gen_policy(GENV, lr = 0.00321)
new_config["multiagent"]["policies"]["policy_1"]=gm.gen_policy(GENV, lr = 0.00175)

trainer["raw_user_config"]=new_config
trainer.config = new_config

it didn't help

markalex
  • 8,623
  • 2
  • 7
  • 32
  • 1
    Hi, this may not do exactly what you want but do have a look at `lr_schedule`. Usage will be something like: `config={"lr_schedule": [[0, 0.1], [400, 0.000001]],})` – Huan Sep 20 '19 at 18:38

2 Answers2

1

I stumbled upon the same question and did some research on the RLlib implementation.

From the testing scripts it looks like the lr_schedule is given by an interval like

lr_schedule: [
            [0, 0.0005],
            [20000000, 0.000000000001],
        ]

After that I checked out the implementation details.
In ray/rllib/policy/torch_policy.py the function LearningRateSchedule implements the entry point.
When a lr_schedule is defined the PiecewiseSchedule is used.

From the implementation of PiecewiseSchedule in ray/rllib/utils/schedules/piecewise_schedule.py follows:

endpoints (List[Tuple[int,float]]): A list of tuples
                `(t, value)` such that the output
                is an interpolation (given by the `interpolation` callable)
                between two values.
                E.g.
                t=400 and endpoints=[(0, 20.0),(500, 30.0)]
                output=20.0 + 0.8 * (30.0 - 20.0) = 28.0
                NOTE: All the values for time must be sorted in an increasing
                order.

That means the learning rate schedule consists of two parameters:
timestep t (int) and suppost learning rate (float)

For each timestep in-between those values an interpolation is used.
The interpolation can be specified inside the function PiecewiseSchedule through the parameter interpolation which defaults to _linear_interpolation

interpolation (callable): A function that takes the left-value,
                the right-value and an alpha interpolation parameter
                (0.0=only left value, 1.0=only right value), which is the
                fraction of distance from left endpoint to right endpoint.

TL;DR;

Therefore the lr_schedule describes the support points of the linear interpolation (using the default interpolation).

Additionally to change the parameter during the training from this Github Issue the best choice seems to reinitialize the trainer:

state = trainer.save()
trainer.stop()
#re_initialise trainer
trainer.restore(state)
Nils
  • 2,665
  • 15
  • 30
1

I found the simple examples here a little confusing. So I wanted to add a clear answer. To ensure that other users don't have to look into the code I added an issue and wanted to add my answer here: https://github.com/ray-project/ray/issues/15647

This is a tested example for a linear decreasing learning rate until a certain point.

lr_start = 2.5e-4
lr_end = 2.5e-5
lr_time = 50 * 1000000
config = {
    "lr": lr_start,
    "lr_schedule": [
        [0, lr_start],
        [lr_time, lr_end],
    ],
}

lr

Rocket
  • 1,030
  • 5
  • 24