I want to implement the layer-wise learning rate decay while still using a Scheduler. Specifically, what I currently have is:
model = Model()
optim = optim.Adam(lr=0.1)
scheduler = optim.lr_scheduler.OneCycleLR(optim, max_lr=0.1)
Then, the learning rate is increased to 0.1
in the first 30% of the epochs and gradually decays over time. I want to further add this with layer-wise Learning rate decay.
This tutorial is something that I want to implement, but it uses a fixed LR instead of changing LR like when used with a Scheduler. What I want is at every step, the model still uses the LR it gets from the optimizer, but then every layer's LR is also decayed by a factor. It goes like:
for i in range(steps):
lr = scheduler.get_last_lr()
for idx, layer in enumerate(model.layers()):
layer['lr'] = lr * 0.9 ** (idx+1)
output = model(input)
...
However, when using this, do I have to pass the model.parameters()
to the optimizer again? How will the LR be computed in this scenario? Is there a better way to do this?
Also I am looking for a way to do that for very large models where listing all layers and specifying LRs for each of them is a bit exhaustive.