I am fine-tuning an LLM model. I use an GPU with 15 GB RAM memory, but when PyTorch saves a checkpoint, the OOM exception happens. The full exception stack is:
Can I change the parameter from GPU memory to CPU memory and do the save checkpoint?
I am fine-tuning an LLM model. I use an GPU with 15 GB RAM memory, but when PyTorch saves a checkpoint, the OOM exception happens. The full exception stack is:
Can I change the parameter from GPU memory to CPU memory and do the save checkpoint?
In your image it shows that you're using Tensorflow.
My understanding is that you had no memory issues during training period, but an OOM occurred while storing the checkpoint and you want to avoid this by dumping the model to the CPU at that time. This could be achieved by:
with tf.device('/cpu:0'):
with tf.Session() as sess:
# your code here