Why loaded Pytorch model's loss highly increased?

Question

I'm trying to train Arcface with reference to.

As far as I know, Arcface requires more than 200 training epochs on CASIA-webface with a large batch size.

Within 100 epochs of training, I stopped the training for a while because I was needed to use GPU for other tasks. And the checkpoints of the model(Resnet) and margin are saved. Before it was stopped, its loss recorded a value between 0.3~1.0, and training accuracy was mount to 80~95%.

When I resume the Arcface training by loading the checkpoint files using load_sate, it seems normal when the first batch is processed. But suddenly the loss increased sharply and the accuracy became very low.

Its loss suddenly became increased. How did this happen? I had no other way so anyway continued the training, but I don't think the loss is decreasing well even though it is a trained model over 100 epochs...

When I searched for similar issues, they told the problem was that the optimizer was not saved (Because the reference github page didn't save the optimizer, so did I. Is it true?

My losses after loading

It could be due to not saving the optimizer state. Also, did you make sure that you are not starting with a higher learning rate? — Umang Gupta, Oct 27 '20 at 16:41
I solved it thanks to your comment. The github code I've referred had set the initial learning rate value as 0.1, When I changed it to smaller value the accuracy doesn't get sharply decreased. Thank you for the advice. :) I would better save the optimizer next time.. — CDM, Oct 27 '20 at 17:30

score 1 · Accepted Answer · answered Oct 27 '20 at 17:03

if you see this line! you are Decaying the learning rate of each parameter group by gamma. This has altered your learning rate as you had reached 100th epoch. and moreover you had not saved your optimizer state while saving your model.
This made your code to start with the starting lr i.e 0.1 after resuming your training. And this spiked your loss again.

Vote if you find this useful

Why loaded Pytorch model's loss highly increased?

1 Answers1