what happens to model weights and how does checkpointing work?

Question

I have basic question about about model weights and checkpoints.

When training a model, each layer in the model graph calls kernel executed on the GPU. These weights remain on the GPU for forward pass and backward pass. Once the weights are updated during backward pass, where are all the updated weights stored. Are they moved back to CPU memory? when does this move happen ?

when checkpointing is done, do we get weights from CPU memory ?

Can someone explain the whole execution flow ?

score 1 · Answer 1 · answered Feb 07 '23 at 01:24

In most cases, the updated weights from the backward pass remain on the GPU memory. The weights are typically stored in the GPU's memory as floating-point numbers, which allows for fast matrix operations and helps to optimize the training process. The weights are updated during each iteration of the training loop and remain on the GPU until the end of the training process. When checkpointing is done, the weights are saved to disk, either on the CPU or in a remote storage if the execution is stopped. These weights are usually loaded in CPU memory when needed for execution. This is the general process but it can vary with architecture and hardware sometimes.

score 0 · Answer 2 · answered Feb 07 '23 at 09:48

The weights stay on the GPU unless they are explicitly moved somewhere else.
When you save a checkpoint, the weights are serialized to the disk using pickle, without being first moved to CPU, That's why if e.g. you pickle a model's state_dict thats on the GPU, and try to load it on a system without a GPU, it will fail.

Also note that the pickle itself, has to move the data it needs to dump, to system ram and does its required processings first but it doesn't change the objects underlying attributes when doing so, thus your models weight gets stored in its original form and intact attributes.

what happens to model weights and how does checkpointing work?

2 Answers2