11

I am trying to train a network on Caffe. I have image size of 512x640. Batch size is 1. I'm trying to implement FCN-8s.

I am currently running this on a Amazon EC2 instance (g2.2xlarge) with 4GB of GPU memory. But when I run the solver, it immediately throws out an error

Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
Aborted (core dumped)

Can someone help me proceed from here?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Abhilash Panigrahi
  • 1,455
  • 1
  • 13
  • 31
  • related: http://stackoverflow.com/q/36526959/1714410 – Shai Nov 30 '16 at 14:35
  • two solution:one, you can try to decrease your batch size, but your batch size=1, and useless, then you can rezise your picture, decrease your image size can be useful;two, you can buy a better GPU. – blackdusts Mar 01 '17 at 13:00

4 Answers4

17

The error you get is indeed out of memory, but it's not the RAM, but rather GPU memory (note that the error comes from CUDA).
Usually, when caffe is out of memory - the first thing to do is reduce the batch size (at the cost of gradient accuracy), but since you are already at batch size = 1...
Are you sure batch size is 1 for both TRAIN and TEST phases?

Shai
  • 111,146
  • 38
  • 238
  • 371
  • I guessed so. And yes, both train and test phases' batch size is 1. I think I have resize the training images to something smaller and try it out. But why is 4GB of GPU Memory turning out to be less space? It says `The total number of bytes read was 537399810` which is much smaller than 4GB. – Abhilash Panigrahi Nov 19 '15 at 08:11
  • @AbhilashPanigrahi is it possible some other processes are using GPU at the same time? try command line `nvidia-smi` to see what's going on on your GPU. – Shai Nov 19 '15 at 08:18
  • I did. No other process is running apart from this (which automatically quits after a few seconds because of the error). – Abhilash Panigrahi Nov 19 '15 at 08:21
  • 1
    I just reduced the image and label size to about 256x320. It runs successfully. I saw it is using around 3.75 GB of GPU memory. Thanks for the help. – Abhilash Panigrahi Nov 19 '15 at 08:47
  • 1
    Is it helpful to add dropout layer if the batch_size is already at 1? @Shai –  Nov 30 '16 at 14:31
  • 1
    @thigi it's unrelated. you can add dropout even when batch_size is one, the dropout does not drop entire samples, but rather prune some of the output neurons. You can have an actual batch size larger than one using `iter_size`. see [this thread](http://stackoverflow.com/q/36526959/1714410). – Shai Nov 30 '16 at 14:34
  • Ok and what is a usual value for iter_size * batch_size? like what should the value of the result be? Is there a rule of thumb? @Shai –  Nov 30 '16 at 14:44
  • resizing the image helped in my case also. another thing that i did was to move to amazon p2 gpu instance which is costlier but comes with a gpu memory limit of 12gb, which should be good enough for fcn. – koshy george Dec 17 '16 at 14:44
  • 1
    This was unrelated to my issue, but your answer gave me a hint as to why I was running out of memory. My test batch size was larger than my training batch size so making the test batch size smaller fixed my error. Thank you Shai! – rayryeng Feb 17 '17 at 20:38
  • @rayryeng my pleasure! – Shai Feb 18 '17 at 16:27
  • There's two places to adjust batch size. For me adjusting the batch size in the prototxt would get overwritten every time I ran examples/ssd/ssd_pascal.py. There's a line under the gpus initiation ~337 that defines two variables (batch_size and accum_batch_size). Setting them both to three fixed my issue. I was bound to a max batch_size of 4. – andor kesselman Jun 23 '17 at 16:42
  • also running on a NVIDIA Quadro K620, which is 2GB memory. – andor kesselman Jun 23 '17 at 16:44
  • @and0rsk with SSD it's a bit different since all prototxt are generated by python, you have to work through the python code there. – Shai Jun 24 '17 at 19:02
2

Caffe can use multiple GPU's. This is only supported in the C++ interface, not in the python one. You could also enable cuDNN for a lower memory footprint.

https://github.com/BVLC/caffe/blob/master/docs/multigpu.md

Simon
  • 5,464
  • 6
  • 49
  • 85
0

I was facing a similar issue when running Deeplab v2 on a PC with following configuration:

----------
OS: Ubuntu 18.04.3 LTS (64-bit)
----------
Processor: Intel Core i7-6700k CPU @ 4.00 GHz x 8
----------
GPU: GeForce GTX 780 (3022 MiB)
----------
RAM : 31.3 GiB
----------

Changing both the test and training batch sizes to 1 didn't help me. But, changing the dimensions of the output image sure did!

Dharman
  • 30,962
  • 25
  • 85
  • 135
0

I faced the same issue. It got resolved after I force killed the process linked with training -> kill -9 pid. For some reason, the previous train.py process was still running.

harshith__
  • 447
  • 3
  • 15