How can i save model while training in torch

Question

I am training RoBERTa model for a new language, and it takes some hours to train the data. So I think it is a good idea to save the model while training so that I can continue training the model from where it stops next time.

I am using torch library and google Colab GPU to train the model.

Here is my colab file. https://colab.research.google.com/drive/1jOYCaLdxYRwGMqMciG6c3yPYZAsZRySZ?usp=sharing

Does this answer your question? [Best way to save a trained model in PyTorch?](https://stackoverflow.com/questions/42703500/best-way-to-save-a-trained-model-in-pytorch) — sagi, Feb 07 '22 at 12:56
I do not want to save a trained model. I want to save it while training, so if it stops training for some reason I can re-run it and it can continue training from where it stops. — robel, Feb 07 '22 at 13:25
A "trained model" doesn't mean a model that has finished its training step. It simply mean a model that was trained. A model that trained an epoch out of a 100 epochs is indeed a "trained model". Have you looked at the answer? — sagi, Feb 07 '22 at 13:27

score 2 · Accepted Answer · answered Feb 08 '22 at 10:18

You can use the Trainer from transformers to train the model. This trainer will also need you to specify the TrainingArguments, which will allow you to save checkpoints of the model while training.

Some of the parameters you set when creating TrainingArguments are:

save_strategy: The checkpoint save strategy to adopt during training. Possible values are:
- "no": No save is done during training.
- "epoch": Save is done at the end of each epoch.
- "steps": Save is done every save_steps.
save_steps: Number of updates steps before two checkpoint saves if save_strategy="steps".
save_total_limit: If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir.
load_best_model_at_end: Whether or not to load the best model found during training at the end of training.

One important thing about load_best_model_at_end is that when set to True, the parameter save_strategy needs to be the same as eval_strategy, and in the case it is “steps”, save_steps must be a round multiple of eval_steps.

How can i save model while training in torch

1 Answers1