1

I am fine-tuning an LLM model. I use an GPU with 15 GB RAM memory, but when PyTorch saves a checkpoint, the OOM exception happens. The full exception stack is:

Enter image description here

Can I change the parameter from GPU memory to CPU memory and do the save checkpoint?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
王泽君
  • 11
  • 1
  • Please review *[Why not upload images of code/errors when asking a question?](https://meta.stackoverflow.com/questions/285551/)* (e.g., *"Images should only be used to illustrate problems that* ***can't be made clear in any other way,*** *such as to provide screenshots of a user interface."*) and [do the right thing](https://stackoverflow.com/posts/76320113/edit). Thanks in advance. – Peter Mortensen May 24 '23 at 11:19

1 Answers1

0

In your image it shows that you're using Tensorflow.

My understanding is that you had no memory issues during training period, but an OOM occurred while storing the checkpoint and you want to avoid this by dumping the model to the CPU at that time. This could be achieved by:

with tf.device('/cpu:0'):
  with tf.Session() as sess:
    # your code here
Well Honey
  • 1
  • 1
  • 2