10

I'm using a GPU on Google Colab to run some deep learning code.

I have got 70% of the way through the training, but now I keep getting the following error:

RuntimeError: CUDA out of memory. Tried to allocate 2.56 GiB (GPU 0; 15.90 GiB total capacity; 10.38 GiB already allocated; 1.83 GiB free; 2.99 GiB cached)

I'm trying to understand what this means. Is it talking about RAM memory? If so, the code should just run the same as is has been doing shouldn't it? When I try to restart it, the memory message appears immediately. Why would it be using more RAM when I start it today than it did when I started it yesterday or the day before?

Or is this message about hard disk space? I could understand that because the code saves things as it goes on and so the hard disk usage would be cumulative.

Any help would be much appreciated.


So if it's just the GPU running out of memory - could someone explain why the error message says 10.38 GiB already allocated - how can there be memory already allocated when I start to run something. Could that be being used by someone else? Do I just need to wait and try again later?

Here is a screenshot of the GPU usage when I run the code, just before it runs out of memory:

enter image description here


I found this post in which people seem to be having similar problems. When I run a code suggested on that thread I see:

Gen RAM Free: 12.6 GB  | Proc size: 188.8 MB
GPU RAM Free: 16280MB | Used: 0MB | Util   0% | Total 16280MB

which seems to suggest there is 16 GB of RAM free.

I'm confused.

user1551817
  • 6,693
  • 22
  • 72
  • 109

8 Answers8

10

Try reducing your batch size to 8 or 16. It worked for me

ramnarayan
  • 141
  • 2
  • 5
9

You are getting out of memory in GPU. If you are running a python code, try to run this code before yours. It will show the amount of memory you have. Note that if you try in load images bigger than the total memory, it will fail.

# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize

import psutil
import humanize
import os
import GPUtil as GPU

GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
    process = psutil.Process(os.getpid())
    print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available), " |     Proc size: " + humanize.naturalsize(process.memory_info().rss))
    print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total     {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Etore Marcari Jr.
  • 572
  • 2
  • 10
  • 19
7

Google Colab resource allocation is dynamic, based on users past usage. Suppose if a user has been using more resources recently and a new user who is less frequently uses Colab, he will be given relatively more preference in resource allocation.

Hence to get the max out of Colab , close all your Colab tabs and all other active sessions ,restart runtime for the one you want to use. You'll definitely get better GPU allocation.

If you are training a NN and still face the same issue Try to reduce the batch size too.

  • 1
    Closing all other active *Colab Tabs* worked for me when combined with reducing the batch size from 16 to 8. Kudos, mate! – odunayo12 Sep 21 '21 at 09:51
  • This seems odd to me. As a free user I made the most of the time they gave me and so, when I finally hit the usage limit, I opted to pay for Colab Pro (while also getting more memory, so they say). Yet now *anything* I try to run today fails with Out Of Memory errors—that I wasn't getting as a free user. Huh? – cbmtrx Jun 12 '22 at 20:31
3

Just as an answer to other people using Google Colab. I had this problem often when I used it for my deep learning class. I started paying for Google Colab and it immediately started allowing me to run my code. This however does not stop the problem completely. I started using Google Colab for my research and hit this error again! I started researching on Google Colabs website and found that there are GPU usage limits even for people who pay for Google Colab. To test this I tried using a secondary gmail account I rarely use. Sure enough it ran perfectly...

So in short. Share your code with a secondary email or set up a new email account. Sign into Colab with the secondary account. If that works for any of you, comment below so people are aware of this. I found it super frustrating and lost a lot of time to this error.

Cami
  • 31
  • 1
  • Did you get a Google Colab subscription for your secondary account? – kabhel Sep 07 '21 at 09:42
  • @kabhel My Q too. That would simply mean they're back to running Colab in free mode...which defeats the point of PAYING for Pro privileges. Right? – cbmtrx Jun 12 '22 at 20:33
0

I was attempting to use the trained model to predict the test dataset (~17,000 entries) when CUDA out of memory error appeared.

Reducing the batch size 32 > 4 didn't work for me, I was able to see that the memory required to run the operation was not decreasing even with the change in batch size.

What worked for me was reducing the test dataset size into smaller sized chunks, and merging the predicted output back into a combined dataframe subsequently.

Goh Jia Yi
  • 329
  • 3
  • 16
0

There are few techniques to tackle this problem:

  1. reduce the batch size, let's say if you have 1000 reduce to 700 or 500, restart the runtime
  2. go to runtime-> factory reset runtime
  3. reduce the num_worker
0

I got this after running a few training sessions on my notebook, so I assumed something's staying too long in memory.

import gc
gc.collect()

Solved it, although I had to sometimes wait a few seconds after I ran GC for some reason.

Aur Saraf
  • 3,214
  • 1
  • 26
  • 15
0

My woes were caused by retaining my loss on the gpu, and appending it to a list. (That probably caused torch to keep the whole graph intact, and it took only a few batches to consume all available GPU ram.) For example, when you save the loss of the model, make sure to do:

epoch_losses.append(loss.item())

rather than

epoch_losses.append(loss)

GregarityNow
  • 707
  • 1
  • 11
  • 25