How to avoid "CUDA out of memory" in PyTorch

Question

I think it's a pretty common message for PyTorch users with low GPU memory:

RuntimeError: CUDA out of memory. Tried to allocate X MiB (GPU X; X GiB total capacity; X GiB already allocated; X MiB free; X cached)

I tried to process an image by loading each layer to GPU and then loading it back:

for m in self.children():
    m.cuda()
    x = m(x)
    m.cpu()
    torch.cuda.empty_cache()

But it doesn't seem to be very effective. I'm wondering is there any tips and tricks to train large deep learning models while using little GPU memory.

What's up with the smileys? lol.. Also, decrease your batch size and/or train on smaller images. Look at the Apex library for mixed precision training. Finally, when decreasing the batch size to, for example, 1 you might want to hold off on setting the gradients to zero after every iteration, since it's only based on a single image. — sansa, Dec 01 '19 at 21:02
I had the same problem using Kaggle. It worked fine with batches of 64 and then once I tried 128 and got the error nothing worked. Even the batches of 64 gave me the same error. Tried resetting a few times. `torch.cuda.empty_cache()` did not work. Instead first disable the GPU, then restart the kernel, and reactivate the GPU. This worked for me. — multitudes, Jul 01 '20 at 16:43
Reduce the batch size of the data being fed to your model. Worked for me — patrickpato, Feb 27 '21 at 03:10
This is one of [Frequently Asked Questions](https://pytorch.org/docs/stable/notes/faq.html) of PyTorch, you can read through the guide to help locate the problem. — Ynjxsjmh, Apr 21 '22 at 12:33

score 108 · Accepted Answer · edited Mar 30 '22 at 04:45

108

Although

import torch
torch.cuda.empty_cache()

provides a good alternative for clearing the occupied cuda memory and we can also manually clear the not in use variables by using,

import gc
del variables
gc.collect()

But still after using these commands, the error might appear again because pytorch doesn't actually clears the memory instead clears the reference to the memory occupied by the variables. So reducing the batch_size after restarting the kernel and finding the optimum batch_size is the best possible option (but sometimes not a very feasible one).

Another way to get a deeper insight into the alloaction of memory in gpu is to use:

torch.cuda.memory_summary(device=None, abbreviated=False)

wherein, both the arguments are optional. This gives a readable summary of memory allocation and allows you to figure the reason of CUDA running out of memory and restart the kernel to avoid the error from happening again (Just like I did in my case).

Passing the data iteratively might help but changing the size of layers of your network or breaking them down would also prove effective (as sometimes the model also occupies a significant memory for example, while doing transfer learning).

edited Mar 30 '22 at 04:45

Mateen Ulhaq

24,552
19
101
135

answered Jun 24 '20 at 13:48

SHAGUN SHARMA

1,418
1
7
5

17

`This gives a readable summary of memory allocation and allows you to figure the reason of CUDA running out of memory`. I printed out the results of the `torch.cuda.memory_summary()` call, but there doesn't seem to be anything informative that would lead to a fix. I see rows for `Allocated memory`, `Active memory`, `GPU reserved memory`, etc. What should I be looking at, and how should I take action? – stackoverflowuser2010 Sep 18 '20 at 00:54
I have a small laptop with MX130 and 16GB ram. Suitable batchsize was 4. – Gayan Kavirathne Oct 15 '20 at 15:47
2

@stackoverflowuser2010 You should be printing it out between function calls to see which one causes the most memory increase – JobHunter69 May 05 '21 at 17:27
4

do `print(torch.cuda.memory_summary(device=None, abbreviated=False))` to get the info in a prettified manner – Elvin Aghammadzada Oct 25 '22 at 16:29

score 53 · Answer 2 · answered Oct 13 '20 at 02:27

53

Just reduce the batch size, and it will work. While I was training, it gave following error:

CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 4.29 GiB already allocated; 10.12 MiB free; 4.46 GiB reserved in total by PyTorch)

And I was using batch size of 32. So I just changed it to 15 and it worked for me.

answered Oct 13 '20 at 02:27

Rahul

789
7
10

3

This doesn't always work. I lowered the batch size from 16 to 2, but still "out of memory". – Danijel May 03 '23 at 12:46

Nicolas Gervais · Answer 3 · 2023-04-17T14:38:51.500

29

Send the batches to CUDA iteratively, and make small batch sizes. Don't send all your data to CUDA at once in the beginning. Rather, do it as follows:

for e in range(epochs):
    for images, labels in train_loader:   
        if torch.cuda.is_available():
            images, labels = images.cuda(), labels.cuda()   
        # blablabla

edited Apr 17 '23 at 14:38

answered Dec 01 '19 at 20:55

Nicolas Gervais

33,817
13
115
143

3

I get this error message inside a jupyter notebook if I run a cell that starts training more than once. Restarting the kernel fixes this, but it would be nice if we could clear the cache somehow... For instance, `torch.cuda.empty_cache()` doesn't help as of now. Even though it probably should... :( – David Jun 11 '20 at 21:56

score 12 · Answer 4 · answered Oct 28 '20 at 19:53

Try not drag your grads too far.

I got the same error when I tried to sum up loss in all batches.

loss =  self.criterion(pred, label)

total_loss += loss

Then I use loss.item instead of loss which requires grads, then solved the problem

loss =  self.criterion(pred, label)

total_loss += loss.item()

The solution below is credited to yuval reina in the kaggle question

This error is related to the GPU memory and not the general memory => @cjinny comment might not work.
Do you use TensorFlow/Keras or Pytorch?
Try using a smaller batch size.
If you use Keras, Try to decrease some of the hidden layer sizes.
If you use Pytorch:
do you keep all the training data on the GPU all the time?
make sure you don't drag the grads too far
check the sizes of you hidden layer

score 8 · Answer 5 · answered Oct 21 '21 at 13:33

8

Most things are covered, still will add a little.

If torch gives error as "Tried to allocate 2 MiB" etc. it is a mis-leading message. Actually, CUDA runs out of total memory required to train the model. You can reduce the batch size. Say, even if batch size of 1 is not working (happens when you train NLP models with massive sequences), try to pass lesser data, this will help you confirm that your GPU does not have enough memory to train the model.

Also, Garbage collection and cleaning cache part has to be done again, if you want to re-train the model.

answered Oct 21 '21 at 13:33

YoungSheldon

774
6
19

1

I was training NLP model and had batch size of 2. Changed to 1 and it worked. – Wojciech Jakubas Feb 20 '22 at 09:53
I trained BERT and RoBERTa and solved it by decreasing the context word window. – TuringTux Jun 27 '22 at 12:41

score 4 · Answer 6 · edited Jul 09 '21 at 11:01

4

Follow these steps:

Reduce train,val,test data
Reduce batch size {eg. 16 or 32}
Reduce number of model parameters {eg. less than million}

In my case, when I am training common voice dataset in kaggle kernels the same error raises. I delt with reducing training dataset to 20000,batch size to 16 and model parameter to 112K.

edited Jul 09 '21 at 11:01

Dharman

30,962
25
85
135

answered Jul 09 '21 at 10:55

Kavi Arasan

41
1

score 4 · Answer 7 · answered Jul 27 '22 at 04:38

If you are done training and just want to test with an image, make sure to add a with torch.no_grad() and m.eval() at the beginning:

with torch.no_grad():
  for m in self.children():
    m.cuda()
    m.eval()
    x = m(x)
    m.cpu()
    torch.cuda.empty_cache()

This may seem obvious but it worked on my case. I was trying to use BERT to transform sentences into an embbeding representation. As BERT is a pre-trained model I didn't need to save all the gradients, and they were consuming all the GPU's memory.

score 2 · Answer 8 · edited Jan 23 '21 at 07:32

There are ways to avoid, but it certainly depends on your GPU memory size:

Loading the data in GPU when unpacking the data iteratively,

features, labels in batch:
   features, labels = features.to(device), labels.to(device)

Using FP_16 or single precision float dtypes.
Try reducing the batch size if you ran out of memory.
Use .detach() method to remove tensors from GPU which are not needed.

If all of the above are used properly, PyTorch library is already highly optimizer and efficient.

score 1 · Answer 9 · answered Jan 29 '23 at 19:19

1

If you are working with images, just reduce the input image shape. For example, if you are using 512x512, try 256x256. It worked for me!

answered Jan 29 '23 at 19:19

Ashwini Kumar Upadhyay

11
3

score 1 · Answer 10 · answered Mar 15 '23 at 23:36

1

Might seem too simplistic but it worked for me; I just closed my VScode and opened it again and then restarted and ran all the cells.

answered Mar 15 '23 at 23:36

fleurderose

25
5

score 0 · Answer 11 · answered Oct 13 '20 at 05:57

Implementation:

Feed the image into gpu batch by batch.
Using a small batch size during training or inference.
Resize the input images with a small image size.

Technically:

Most networks are over parameterized, which means they are too large for the learning tasks. So finding an appropriate network structure can help:

a. Compact your network with techniques like model compression, network pruning and quantization.

b. Directly using a more compact network structure like mobileNetv1/2/3.

c. Network architecture search(NAS).

score 0 · Answer 12 · edited Dec 29 '20 at 19:36

0

I have the same error but fix it by resize my images from ~600 to 100 using the lines:

import torchvision.transforms as transforms
transform = transforms.Compose([
    transforms.Resize((100, 100)), 
    transforms.ToTensor()
])

edited Dec 29 '20 at 19:36

Samuel Prevost

1,047
1
11
30

answered Dec 29 '20 at 19:00

Ramy Abdallah

56
5

score 0 · Answer 13 · answered Jul 26 '21 at 17:10

Although this seems bizarre what I found is there are many sessions running in the background for collab even if we factory reset runtime or we close the tab. I conquered this by clicking on "Runtime" from the menu and then selecting "Manage Sessions". I terminated all the unwanted sessions and I was good to go.

score 0 · Answer 14 · answered Aug 18 '21 at 15:33

0

I would recommend using mixed precision training with PyTorch. It can make training way faster and consume less memory.

Take a look at https://spell.ml/blog/mixed-precision-training-with-pytorch-Xuk7YBEAACAASJam.

answered Aug 18 '21 at 15:33

Karol

600
7
18

score 0 · Answer 15 · answered Dec 01 '21 at 20:23

0

There is now a pretty awesome library which makes this very simple: https://github.com/rentruewang/koila

pip install koila

in your code, simply wrap the input with lazy:

from koila import lazy
input = lazy(input, batch=0)

answered Dec 01 '21 at 20:23

dreamflasher

1,387
15
22

`pip install koila` still gives me `ModuleNotFoundError: No module named 'koila'`, even after Restart and Run All – StressedBoi69420 Dec 09 '21 at 10:53
sounds like you installed into a different environment. Try `which pip`, `which python`, `which python3`, `which pip3` and have a look how you run your python code, that should give an indication what's going on. – dreamflasher Dec 10 '21 at 10:44
koila doesn't support python 3.7 version – user1682140 May 18 '22 at 13:41
python 3.7 is 4 years old. Time to upgrade. – dreamflasher May 19 '22 at 15:28

score 0 · Answer 16 · answered Mar 14 '22 at 06:59

As long as you don't cross a batch size of 32, you will be fine. Just remember to refresh or restart runtime or else even if you reduce the batch size, you will encounter the same error. I set my batch size to 16, it reduces zero gradients from occurring during my training and the model matches the true function much better. Rather than using a batch size of 4 or 8 which causes the training loss to fluctuate than

smith andy · Answer 17 · 2022-03-29T10:49:36.817

0

I meet the same error, and my GPU is GTX1650 with 4g video memory and 16G ram. It worked for me when I reduce the batch_size to 3. Hope this can help you

edited Mar 29 '22 at 10:49

answered Mar 29 '22 at 07:16

smith andy

1
1

score 0 · Answer 18 · answered Jul 11 '22 at 13:48

I faced the same problem and resolved it by degrading the PyTorch version from 1.10.1 to 1.8.1 with code 11.3. In my case, I am using GPU RTX 3060, which works only with Cuda version 11.3 or above, and when I installed Cuda 11.3, it came with PyTorch 1.10.1. So I degraded the PyTorch version, and now it is working fine.

$ pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html

2- You can check by reducing train batch size also.

score 0 · Answer 19 · answered Mar 18 '23 at 22:50

I see no one advice wait after collection of garbage. If nothing help you you can try wait befor garbage collected. Try this:

import torch
import time
import gc
from pynvml import nvmlInit, nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo

def clear_gpu_memory():
    torch.cuda.empty_cache()
    gc.collect()
    del variables

def wait_until_enough_gpu_memory(min_memory_available, max_retries=10, sleep_time=5):
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(torch.cuda.current_device())

    for _ in range(max_retries):
        info = nvmlDeviceGetMemoryInfo(handle)
        if info.free >= min_memory_available:
            break
        print(f"Waiting for {min_memory_available} bytes of free GPU memory. Retrying in {sleep_time} seconds...")
        time.sleep(sleep_time)
    else:
        raise RuntimeError(f"Failed to acquire {min_memory_available} bytes of free GPU memory after {max_retries} retries.")

# Usage example
min_memory_available = 2 * 1024 * 1024 * 1024  # 2GB
clear_gpu_memory()
wait_until_enough_gpu_memory(min_memory_available)

score 0 · Answer 20 · answered Aug 06 '23 at 17:02

0

Though not relevant to the original question, I faced the same issue while using https://github.com/oobabooga/text-generation-webui Bing search results in this particular SO page as the top result. I resolved this by increasing the GPU memory:

answered Aug 06 '23 at 17:02

banarasi

39
2

1

which software you have used for this? usually I do everything over console. Just faced with this issue. – Gleichmut Aug 08 '23 at 16:38

score -2 · Answer 21 · edited Oct 25 '20 at 23:09

-2

Best way would be lowering down the batch size. Usually it works. Otherwise try this:

import gc

del variable #delete unnecessary variables 
gc.collect()

edited Oct 25 '20 at 23:09

Dharman

30,962
25
85
135

answered Oct 20 '20 at 10:58

Harshad Patil

261
2
8

How to avoid "CUDA out of memory" in PyTorch

21 Answers21

Linked