1

Has anyone ever tried to train a Pytorch LSTM model, save it, reload it somewhere else and then continue training? I've been trying to do something like this for the past 2 weeks with no good results (I kept track using the training loss). It seems everytime I reload the model and train it, it behaves arbitrarily (I know this because if I train it continuously it shows very different training losses).

I have tried using the pytorch.save() fn, saving the state_dict and loading it, native pickle mechanism as well as joblib for the same but all of them have the same issue. I even saved the optimizer states and reloaded it without much luck.

Could it somehow be related to the hidden and cell states of the LSTM layers? Should I save and reload them as well everytime I want to train? Or could it be something else entirely?

I have searched extensively for this issue but to no avail. Any help would be much appreciated.

Some more info: I'm trying to detect anomalies in data using an autoencoder and so each time I reload the model, it is trained only on a batch of data and then saved again to be reused for next batch.

  • 2
    maybe [this](https://stackoverflow.com/a/43819235/2996989) answer is helpful – Ahmed Sunny Dec 19 '19 at 14:48
  • Have you tried to official tutorial to resume the training of a model with pytorch? https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-a-general-checkpoint-for-inference-and-or-resuming-training You need to save the optimizer, loss, epoch, etc, along with the model state dictionary. – Eskapp Dec 19 '19 at 15:02
  • @AhmedSunny I have already tried that but to no avail. My intuition is that because I want the training of a loaded model to continue from the previous state, I need to somehow save and load the hidden states of the model as well at each training step because the hidden_states reinitialize at the start of each epoch. – Akshay Bhardwaj Dec 20 '19 at 05:18
  • @Eskapp Sadly, I've tried all of the above but with no avail. – Akshay Bhardwaj Dec 20 '19 at 05:19

0 Answers0