Caffe out of memory, where is it used?

Question

I'm trying to train a network in Caffe, a slightly modified SegNet-basic model.

I understand that the Check failed: error == cudaSuccess (2 vs. 0) out of memory error I am getting is due to me running out of GPU memory. However, what puzzles me is this:

My "old" training attempts worked fine. The network initialized and ran, with the following:

batch size 4
Memory required for data: 1800929300 (this calculates in the batch size, so it is 4x sample size here)
Total number of parameters: 1418176
the network is made out of 4x(convolution, ReLU, pooling) followed by 4x(upsample, deconvolution); with 64 filters with kernel size 7x7 per layer.

What surprises me that my "new" network runs out of memory, and I don't understand what is reserving the additional memory, since I lowered the batch size:

batch size 1
Memory required for data: 1175184180 ( = sample size)
Total number of parameters: 1618944
The input size is doubled along each dimension (expected output size does not change), hence the reason for increased number of parameters is one additional set of (convolution, ReLU, pooling) in the beginning of the network.

The number of parameters was counted by this script, by summing up the layer-wise parameters, obtained by multiplying the number of dimensions in each layer.

Assuming that each parameter needs 4 bytes of memory, that still gives data_memory+num_param*4 higher memory requirements for my old setup memory_old = 1806602004 = 1.68GB as compared to the new, memory_new = 1181659956 = 1.10GB.

I've accepted that the additional memory is probably needed somewhere, and that I'll have to re-think my new setup and downsample my input if I can't find a GPU with more memory, however I am really trying to understand where the additional memory is needed and why my new setup is running out of memory.

EDIT: Per request, here are the layer dimensions for each of the networks coupled with the size of the data that passes through it:

"Old" network:

                            Top shape: 4 4 384 512 (3145728)
('conv1', (64, 4, 7, 7))           --> 4 64 384 512 (50331648)
('conv1_bn', (1, 64, 1, 1))        --> 4 64 384 512 (50331648)
('conv2', (64, 64, 7, 7))          --> 4 64 192 256 (12582912)
('conv2_bn', (1, 64, 1, 1))        --> 4 64 192 256 (12582912)
('conv3', (64, 64, 7, 7))          --> 4 64 96 128 (3145728)
('conv3_bn', (1, 64, 1, 1))        --> 4 64 96 128 (3145728)
('conv4', (64, 64, 7, 7))          --> 4 64 48 64 (786432)
('conv4_bn', (1, 64, 1, 1))        --> 4 64 48 64 (786432)
('conv_decode4', (64, 64, 7, 7))   --> 4 64 48 64 (786432)
('conv_decode4_bn', (1, 64, 1, 1)) --> 4 64 48 64 (786432)
('conv_decode3', (64, 64, 7, 7))   --> 4 64 96 128 (3145728)
('conv_decode3_bn', (1, 64, 1, 1)) --> 4 64 96 128 (3145728)
('conv_decode2', (64, 64, 7, 7))   --> 4 64 192 256 (12582912)
('conv_decode2_bn', (1, 64, 1, 1)) --> 4 64 192 256 (12582912)
('conv_decode1', (64, 64, 7, 7))   --> 4 64 384 512 (50331648)
('conv_decode1_bn', (1, 64, 1, 1)) --> 4 64 384 512 (50331648)
('conv_classifier', (3, 64, 1, 1))

For the "New" network, the top few layers differ and the rest is exactly the same except that the batch size is 1 instead of 4:

                            Top shape: 1 4 769 1025 (3152900)
('conv0', (64, 4, 7, 7))           --> 1 4 769 1025 (3152900) 
('conv0_bn', (1, 64, 1, 1))        --> 1 64 769 1025 (50446400)
('conv1', (64, 4, 7, 7))           --> 1 64 384 512 (12582912)
('conv1_bn', (1, 64, 1, 1))        --> 1 64 384 512 (12582912)
('conv2', (64, 64, 7, 7))          --> 1 64 192 256 (3145728)
('conv2_bn', (1, 64, 1, 1))        --> 1 64 192 256 (3145728)
('conv3', (64, 64, 7, 7))          --> 1 64 96 128 (786432)
('conv3_bn', (1, 64, 1, 1))        --> 1 64 96 128 (786432)
('conv4', (64, 64, 7, 7))          --> 1 64 48 64 (196608)
('conv4_bn', (1, 64, 1, 1))        --> 1 64 48 64 (196608)
('conv_decode4', (64, 64, 7, 7))   --> 1 64 48 64 (196608)
('conv_decode4_bn', (1, 64, 1, 1)) --> 1 64 48 64 (196608)
('conv_decode3', (64, 64, 7, 7))   --> 1 64 96 128 (786432)
('conv_decode3_bn', (1, 64, 1, 1)) --> 1 64 96 128 (786432)
('conv_decode2', (64, 64, 7, 7))   --> 1 64 192 256 (3145728)
('conv_decode2_bn', (1, 64, 1, 1)) --> 1 64 192 256 (3145728)
('conv_decode1', (64, 64, 7, 7))   --> 1 64 384 512 (12582912)
('conv_decode1_bn', (1, 64, 1, 1)) --> 1 64 384 512 (12582912)
('conv_classifier', (3, 64, 1, 1))

This skips the pooling and upsampling layers. Here is the train.prototxt for the "new" network. The old network does not have the layers conv0, conv0_bn and pool0, while the other layers are the same. The "old" network also has batch_size set to 4 instead of 1.

EDIT2: Per request, even more info:

All the input data has the same dimensions. It's a stack of 4 channels, each of the size 769x1025, so always 4x769x1025 input.
The caffe training log is here: as you can see, I get out of memory just after network initialization. Not a single iteration runs.
My GPU has 8GB of memory, while I've just found out (trying it on a different machine) that this new network requires 9.5GB of GPU memory.
Just to re-iterate, I am trying to understand how come my "old" setup fits into 8GB memory and the "new" one doesn't, as well as why the amount of memory needed for the additional data is ~8 times larger than the memory needed to hold the input. However, now that I have confirmed that the "new" setup takes only 9.5GB, it might not be as much bigger from the "old" one as I suspected (unfortunately the GPU is currently being used by somebody else so I can't check how much memory the old setup needed exactly)

The size of the input and the feature maps also determines memory use, its not just about the number of parameters. You should share the full architecture of both models. — Dr. Snoopy, Mar 01 '19 at 15:16
@MatiasValdenegro I've updated the question with the sizes of the input and the feature maps. Should I also share the prototxt describing the architecture through pastebin or similar? — penelope, Mar 01 '19 at 16:12
1. are all training images of the same size/resolution? 2. are you getting "out of memory" at the begining of training or after a while? when validating? 3. what solver are you using? SGD? ADAM? 4. How much memory does your GPU have? — Shai, Mar 05 '19 at 06:41

score 0 · Answer 1 · answered Mar 11 '19 at 09:50

0

Bear in mind that caffe actually allocates room for two copies of the net: the "train phase" net and the "test phase" net. So if the data takes 1.1GB you need to double this space.
Moreover, you need to allocate space for the parameters. Each parameter needs to store its gradient. In addition, the solver keeps track of the "momentum" for each parameter (sometimes even 2nd moment, e.g., in ADAM solver). Therefore, increasing the number of parameters even by a tiny amount can result with significant addition to memory footprint of the training system.

answered Mar 11 '19 at 09:50

Shai

111,146
38
238
371

I guess the extra space needed per parameter is the most probable explanation. But, can you clarify about the network storing both the train and test phase of the net? I find this quite confusing, as for the training I only pass the `train.prototxt` (or rather, `solver.prototxt` pointing only to the `train`) to caffe; no files pointing to any testing files are used or read in any way until the training is completely done, and I start running my testing and evaluation procedures. – penelope Mar 12 '19 at 12:15
@penelope look at the log file: you'll see caffe building the net twice, once for training and once for evaluation, aka `phase: TEST`. – Shai Mar 12 '19 at 14:19

Caffe out of memory, where is it used?

1 Answers1