I'm trying to train a network in Caffe, a slightly modified SegNet-basic
model.
I understand that the Check failed: error == cudaSuccess (2 vs. 0) out of memory
error I am getting is due to me running out of GPU memory. However, what puzzles me is this:
My "old" training attempts worked fine. The network initialized and ran, with the following:
- batch size 4
Memory required for data: 1800929300
(this calculates in the batch size, so it is4x
sample size here)- Total number of parameters: 1418176
- the network is made out of 4x(convolution, ReLU, pooling) followed by 4x(upsample, deconvolution); with 64 filters with kernel size
7x7
per layer.
What surprises me that my "new" network runs out of memory, and I don't understand what is reserving the additional memory, since I lowered the batch size:
- batch size 1
Memory required for data: 1175184180
( = sample size)- Total number of parameters: 1618944
- The input size is doubled along each dimension (expected output size does not change), hence the reason for increased number of parameters is one additional set of (convolution, ReLU, pooling) in the beginning of the network.
The number of parameters was counted by this script, by summing up the layer-wise parameters, obtained by multiplying the number of dimensions in each layer.
Assuming that each parameter needs 4 bytes of memory, that still gives data_memory+num_param*4
higher memory requirements for my old setup memory_old = 1806602004 = 1.68GB
as compared to the new, memory_new = 1181659956 = 1.10GB
.
I've accepted that the additional memory is probably needed somewhere, and that I'll have to re-think my new setup and downsample my input if I can't find a GPU with more memory, however I am really trying to understand where the additional memory is needed and why my new setup is running out of memory.
EDIT: Per request, here are the layer dimensions for each of the networks coupled with the size of the data that passes through it:
"Old" network:
Top shape: 4 4 384 512 (3145728)
('conv1', (64, 4, 7, 7)) --> 4 64 384 512 (50331648)
('conv1_bn', (1, 64, 1, 1)) --> 4 64 384 512 (50331648)
('conv2', (64, 64, 7, 7)) --> 4 64 192 256 (12582912)
('conv2_bn', (1, 64, 1, 1)) --> 4 64 192 256 (12582912)
('conv3', (64, 64, 7, 7)) --> 4 64 96 128 (3145728)
('conv3_bn', (1, 64, 1, 1)) --> 4 64 96 128 (3145728)
('conv4', (64, 64, 7, 7)) --> 4 64 48 64 (786432)
('conv4_bn', (1, 64, 1, 1)) --> 4 64 48 64 (786432)
('conv_decode4', (64, 64, 7, 7)) --> 4 64 48 64 (786432)
('conv_decode4_bn', (1, 64, 1, 1)) --> 4 64 48 64 (786432)
('conv_decode3', (64, 64, 7, 7)) --> 4 64 96 128 (3145728)
('conv_decode3_bn', (1, 64, 1, 1)) --> 4 64 96 128 (3145728)
('conv_decode2', (64, 64, 7, 7)) --> 4 64 192 256 (12582912)
('conv_decode2_bn', (1, 64, 1, 1)) --> 4 64 192 256 (12582912)
('conv_decode1', (64, 64, 7, 7)) --> 4 64 384 512 (50331648)
('conv_decode1_bn', (1, 64, 1, 1)) --> 4 64 384 512 (50331648)
('conv_classifier', (3, 64, 1, 1))
For the "New" network, the top few layers differ and the rest is exactly the same except that the batch size is 1 instead of 4:
Top shape: 1 4 769 1025 (3152900)
('conv0', (64, 4, 7, 7)) --> 1 4 769 1025 (3152900)
('conv0_bn', (1, 64, 1, 1)) --> 1 64 769 1025 (50446400)
('conv1', (64, 4, 7, 7)) --> 1 64 384 512 (12582912)
('conv1_bn', (1, 64, 1, 1)) --> 1 64 384 512 (12582912)
('conv2', (64, 64, 7, 7)) --> 1 64 192 256 (3145728)
('conv2_bn', (1, 64, 1, 1)) --> 1 64 192 256 (3145728)
('conv3', (64, 64, 7, 7)) --> 1 64 96 128 (786432)
('conv3_bn', (1, 64, 1, 1)) --> 1 64 96 128 (786432)
('conv4', (64, 64, 7, 7)) --> 1 64 48 64 (196608)
('conv4_bn', (1, 64, 1, 1)) --> 1 64 48 64 (196608)
('conv_decode4', (64, 64, 7, 7)) --> 1 64 48 64 (196608)
('conv_decode4_bn', (1, 64, 1, 1)) --> 1 64 48 64 (196608)
('conv_decode3', (64, 64, 7, 7)) --> 1 64 96 128 (786432)
('conv_decode3_bn', (1, 64, 1, 1)) --> 1 64 96 128 (786432)
('conv_decode2', (64, 64, 7, 7)) --> 1 64 192 256 (3145728)
('conv_decode2_bn', (1, 64, 1, 1)) --> 1 64 192 256 (3145728)
('conv_decode1', (64, 64, 7, 7)) --> 1 64 384 512 (12582912)
('conv_decode1_bn', (1, 64, 1, 1)) --> 1 64 384 512 (12582912)
('conv_classifier', (3, 64, 1, 1))
This skips the pooling and upsampling layers. Here is the train.prototxt
for the "new" network. The old network does not have the layers conv0
, conv0_bn
and pool0
, while the other layers are the same. The "old" network also has batch_size
set to 4
instead of 1
.
EDIT2: Per request, even more info:
- All the input data has the same dimensions. It's a stack of 4 channels, each of the size
769x1025
, so always4x769x1025
input. - The caffe training log is here: as you can see, I get
out of memory
just after network initialization. Not a single iteration runs. - My GPU has 8GB of memory, while I've just found out (trying it on a different machine) that this new network requires 9.5GB of GPU memory.
- Just to re-iterate, I am trying to understand how come my "old" setup fits into 8GB memory and the "new" one doesn't, as well as why the amount of memory needed for the additional data is ~8 times larger than the memory needed to hold the input. However, now that I have confirmed that the "new" setup takes only 9.5GB, it might not be as much bigger from the "old" one as I suspected (unfortunately the GPU is currently being used by somebody else so I can't check how much memory the old setup needed exactly)