0

I want to implement the layer-wise learning rate decay while still using a Scheduler. Specifically, what I currently have is:

model = Model()
optim = optim.Adam(lr=0.1)
scheduler = optim.lr_scheduler.OneCycleLR(optim, max_lr=0.1)

Then, the learning rate is increased to 0.1 in the first 30% of the epochs and gradually decays over time. I want to further add this with layer-wise Learning rate decay.

This tutorial is something that I want to implement, but it uses a fixed LR instead of changing LR like when used with a Scheduler. What I want is at every step, the model still uses the LR it gets from the optimizer, but then every layer's LR is also decayed by a factor. It goes like:

for i in range(steps):
    lr = scheduler.get_last_lr()
    for idx, layer in enumerate(model.layers()):
        layer['lr'] = lr * 0.9 ** (idx+1)
    output = model(input)
    ...

However, when using this, do I have to pass the model.parameters() to the optimizer again? How will the LR be computed in this scenario? Is there a better way to do this?

Also I am looking for a way to do that for very large models where listing all layers and specifying LRs for each of them is a bit exhaustive.

Minh-Long Luu
  • 2,393
  • 1
  • 17
  • 39

1 Answers1

1

If you want to do something that's not a pain vanilla, pytorch-preimplemented schedule for your learning rate, I recommend foregoing the pytorch scheduler class and manually adjusting the learning rates for each of the parameter groups yourself. You can directly access the learning rates as seen here similar to your code above but accessing the optimizer parameter group handles rather than the model layers directly:

for group in optim.param_groups:
    group["lr"] *= 0.9      # for example

From here, you can either use a list of decay factors or else a dictionary keyed by the parameter group names to make this concise.

DerekG
  • 3,555
  • 1
  • 11
  • 21