why the GPU memory is not the same?

Question

I use two methods to train my own data, the first method is training a model from scratch, the second is using fine-tuning(according to https://github.com/tensorflow/models/tree/master/slim), all parameters are the same for two methods, except checkpoints setting, however, the first method always occurs out of memory (GPU). When decreasing the batch size for the first method, it's ok. what's the reason?

the more information can be found in website: https://github.com/tensorflow/models/issues/848

1. train from scratch
python train_image_classifier.py \
--train_dir=${TRAIN_DIR} \
--dataset_name=my_data
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_resnet_v2 \
--max_number_of_steps=500000 \
--batch_size=48 \
--num_readers=16 \
--learning_rate=0.1 \
--learning_rate_decay_type=exponential \
--num_epochs_per_decay=4.0 \
--learning_rate_decay_factor=0.9 \
--save_interval_secs=6000 \
--save_summaries_secs=1000 \
--log_every_n_steps=100 \
--optimizer=adam \
--opt_epsilon=1e-1 \
--weight_decay=0.0004 \
--num_clones=6
the information of out of memory is like:
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): Total Chunks: 1, Chunks in use: 0 768B allocated for chunks. 768B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 57.57MiB was 32.00MiB, Chunk State:
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0000 of size 1280
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0500 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0600 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0800 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0900 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0a00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0b00 of size 256
................................
...............................
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 265531392 totalling 506.46MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 10.56GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 11343019213
InUse: 11336882944
MaxInUse: 11340513536
NumAllocs: 11965
MaxAllocSize: 2632187904

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 24.38MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[48,8,8,2080]
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 24.38MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[48,8,8,2080]
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor with shape[48,17,17,1088]
[[Node: clone_0/InceptionResnetV2/Repeat_1/block17_17/Conv2d_1x1/BiasAdd = BiasAdd[T=DT_FLOAT, data_format="NHWC", _device="/job:localhost/replica:0/task:0/gpu:0"](clone_0/InceptionResnetV2/Repeat_1/block17_17/Conv2d_1x1/convolution, InceptionResnetV2/Repeat_1/block17_17/Conv2d_1x1/biases/read)]]
[[Node: train_op/_24881 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_188818_train_op", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]



2. training from fine-tuning
    the command is like:
    python train_image_classifier.py \
    --train_dir=${TRAIN_DIR} \
    --dataset_name=my_data\
    --dataset_split_name=train \
    --dataset_dir=${DATASET_DIR} \
    --model_name=inception_resnet_v2 \
    --checkpoint_path=${PRETRAINED_CHECKPOINT_DIR}/inception_resnet_v2_2016_08_30.ckpt \
    --checkpoint_exclude_scopes=InceptionResnetV2/Logits,InceptionResnetV2/AuxLogits \
    --trainable_scopes=InceptionResnetV2/Logits,InceptionResnetV2/AuxLogits \
    --max_number_of_steps=500000 \
    --batch_size=48 \
    --num_readers=16 \
    --learning_rate=0.1 \
    --learning_rate_decay_type=exponential \
    --num_epochs_per_decay=4.0 \
    --learning_rate_decay_factor=0.9 \
    --save_interval_secs=6000 \
    --save_summaries_secs=1000 \
    --log_every_n_steps=100 \
    --optimizer=adam \
    --opt_epsilon=1e-1 \
    --weight_decay=0.0004 \
    --num_clones=6
    the running information is as follows:
    W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0xcad55b0
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 4 with properties:
    name: Tesla K40m
    major: 3 minor: 5 memoryClockRate (GHz) 0.745
    pciBusID 0000:30:00.0
    Total memory: 11.25GiB
    Free memory: 11.12GiB
    W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x3a765740
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 5 with properties:
    name: Tesla K40m
    major: 3 minor: 5 memoryClockRate (GHz) 0.745
    pciBusID 0000:33:00.0
    Total memory: 11.25GiB
    Free memory: 11.12GiB
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 4
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 5
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 4
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 5
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 4
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 5
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 4
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 5
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 0
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 1
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 2
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 3
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 0
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 1
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 2
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 3
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 4 5
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y Y Y N N
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y Y Y N N
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: Y Y Y Y N N
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: Y Y Y Y N N
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 4: N N N N Y Y
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 5: N N N N Y Y
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:09:00.0)
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40m, pci bus id: 0000:0a:00.0)
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K40m, pci bus id: 0000:0d:00.0)
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K40m, pci bus id: 0000:0e:00.0)
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla K40m, pci bus id: 0000:30:00.0)
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla K40m, pci bus id: 0000:33:00.0)
    INFO:tensorflow:Starting Session.
    INFO:tensorflow:Starting Queues.
    INFO:tensorflow:global_step/sec: 0
    INFO:tensorflow:global step 100: loss = 19.9046 (2.07 sec/step)
    INFO:tensorflow:global step 200: loss = 19.8159 (2.64 sec/step)
    INFO:tensorflow:global step 300: loss = 19.7198 (2.82 sec/step)

    train successfully.
3. decrease the batch_size of the first method
If the batch_size is decreased to 28, the training can run successfully, however another problem is found, the processing time of each step becomes longer (3.48 sec/step), while fine-tuning method is (2.07 sec/step).
the command is:
python train_image_classifier.py \
--train_dir=${TRAIN_DIR} \
--dataset_name=my_data\
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_resnet_v2 \
--max_number_of_steps=500000 \
--batch_size=28 \
--num_readers=16 \
--learning_rate=0.1 \
--learning_rate_decay_type=exponential \
--num_epochs_per_decay=4.0 \
--learning_rate_decay_factor=0.9 \
--save_interval_secs=6000 \
--save_summaries_secs=1000 \
--log_every_n_steps=100 \
--optimizer=adam \
--opt_epsilon=1e-1 \
--weight_decay=0.0004 \
--num_clones=6

the running information:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 4 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 4: N N N N Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 5: N N N N Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40m, pci bus id: 0000:0a:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K40m, pci bus id: 0000:0d:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K40m, pci bus id: 0000:0e:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla K40m, pci bus id: 0000:30:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla K40m, pci bus id: 0000:33:00.0)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1902 get requests, put_count=1100 evicted_count=1000 eviction_rate=0.909091 and unsatisfied allocation rate=1
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.29GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.09GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.57GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
INFO:tensorflow:global step 100: loss = 26.2026 (3.49 sec/step)
INFO:tensorflow:global step 200: loss = 25.8227 (3.47 sec/step)
INFO:tensorflow:global_step/sec: 0.266924
INFO:tensorflow:global step 300: loss = 25.2874 (3.48 sec/step)
INFO:tensorflow:global step 400: loss = 24.7210 (3.47 sec/step)
INFO:tensorflow:global step 500: loss = 24.2435 (3.47 sec/step)

Welcome to SO! Please move the information required from the GitHub issue to Stackoverflow. — bman, Jan 09 '17 at 02:18
You could follow some techniques [here](http://stackoverflow.com/questions/41513973/buffer-underrun-and-resourceexhausted-errors-with-tensorflow?noredirect=1#comment70280148_41513973) to calculate exact memory allocated/deallocated during each run step and see if there's a difference — Yaroslav Bulatov, Jan 09 '17 at 17:38

score 1 · Answer 1 · answered Jan 09 '17 at 07:39

1

The batch size indicates how many test-cases you are uploading to the gpu for training at the same time. Reducing the batch-size will reduce the memory-footprint for each training step. To get the maximum performance of your training, you would probably like to find the "sweet spot"(the largest batch size without running into OOM) for your training-set and hardware.

answered Jan 09 '17 at 07:39

nilsmagnus

2,210
23
33

Thanks for nilsmagnus's reply. During the training step, I think the two methods should use the same size of GPU memory, however, the first method can not successfully train unless you decrease the batch size. Another problem is why the smaller batch size has a slower running speed? I think the bigger batch size is, the quicker training speed should have. – marvision Jan 09 '17 at 09:18
The smaller the batchsize is, the more batches of test-data will have to be transferred from RAM to gpu-RAM. The time spent on this IO is the cause of the slowdown. – nilsmagnus Jan 09 '17 at 09:36
If it's IO problem, I think this problem is too serious. Is there any way to solve this problem? Now I only train on one computer with 6 GPU cards, if training on distributed computers (more computers and more GPUs), is it means that the framework can not work? – marvision Jan 12 '17 at 02:02

why the GPU memory is not the same?

1 Answers1