I use two methods to train my own data, the first method is training a model from scratch, the second is using fine-tuning(according to https://github.com/tensorflow/models/tree/master/slim), all parameters are the same for two methods, except checkpoints setting, however, the first method always occurs out of memory (GPU). When decreasing the batch size for the first method, it's ok. what's the reason?
the more information can be found in website: https://github.com/tensorflow/models/issues/848
1. train from scratch
python train_image_classifier.py \
--train_dir=${TRAIN_DIR} \
--dataset_name=my_data
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_resnet_v2 \
--max_number_of_steps=500000 \
--batch_size=48 \
--num_readers=16 \
--learning_rate=0.1 \
--learning_rate_decay_type=exponential \
--num_epochs_per_decay=4.0 \
--learning_rate_decay_factor=0.9 \
--save_interval_secs=6000 \
--save_summaries_secs=1000 \
--log_every_n_steps=100 \
--optimizer=adam \
--opt_epsilon=1e-1 \
--weight_decay=0.0004 \
--num_clones=6
the information of out of memory is like:
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): Total Chunks: 1, Chunks in use: 0 768B allocated for chunks. 768B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 57.57MiB was 32.00MiB, Chunk State:
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0000 of size 1280
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0500 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0600 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0800 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0900 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0a00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x231adc0b00 of size 256
................................
...............................
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 265531392 totalling 506.46MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 10.56GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 11343019213
InUse: 11336882944
MaxInUse: 11340513536
NumAllocs: 11965
MaxAllocSize: 2632187904
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 24.38MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[48,8,8,2080]
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 24.38MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[48,8,8,2080]
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor with shape[48,17,17,1088]
[[Node: clone_0/InceptionResnetV2/Repeat_1/block17_17/Conv2d_1x1/BiasAdd = BiasAdd[T=DT_FLOAT, data_format="NHWC", _device="/job:localhost/replica:0/task:0/gpu:0"](clone_0/InceptionResnetV2/Repeat_1/block17_17/Conv2d_1x1/convolution, InceptionResnetV2/Repeat_1/block17_17/Conv2d_1x1/biases/read)]]
[[Node: train_op/_24881 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_188818_train_op", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
2. training from fine-tuning
the command is like:
python train_image_classifier.py \
--train_dir=${TRAIN_DIR} \
--dataset_name=my_data\
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_resnet_v2 \
--checkpoint_path=${PRETRAINED_CHECKPOINT_DIR}/inception_resnet_v2_2016_08_30.ckpt \
--checkpoint_exclude_scopes=InceptionResnetV2/Logits,InceptionResnetV2/AuxLogits \
--trainable_scopes=InceptionResnetV2/Logits,InceptionResnetV2/AuxLogits \
--max_number_of_steps=500000 \
--batch_size=48 \
--num_readers=16 \
--learning_rate=0.1 \
--learning_rate_decay_type=exponential \
--num_epochs_per_decay=4.0 \
--learning_rate_decay_factor=0.9 \
--save_interval_secs=6000 \
--save_summaries_secs=1000 \
--log_every_n_steps=100 \
--optimizer=adam \
--opt_epsilon=1e-1 \
--weight_decay=0.0004 \
--num_clones=6
the running information is as follows:
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0xcad55b0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 4 with properties:
name: Tesla K40m
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:30:00.0
Total memory: 11.25GiB
Free memory: 11.12GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x3a765740
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 5 with properties:
name: Tesla K40m
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:33:00.0
Total memory: 11.25GiB
Free memory: 11.12GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 4 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 4: N N N N Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 5: N N N N Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40m, pci bus id: 0000:0a:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K40m, pci bus id: 0000:0d:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K40m, pci bus id: 0000:0e:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla K40m, pci bus id: 0000:30:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla K40m, pci bus id: 0000:33:00.0)
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global step 100: loss = 19.9046 (2.07 sec/step)
INFO:tensorflow:global step 200: loss = 19.8159 (2.64 sec/step)
INFO:tensorflow:global step 300: loss = 19.7198 (2.82 sec/step)
train successfully.
3. decrease the batch_size of the first method
If the batch_size is decreased to 28, the training can run successfully, however another problem is found, the processing time of each step becomes longer (3.48 sec/step), while fine-tuning method is (2.07 sec/step).
the command is:
python train_image_classifier.py \
--train_dir=${TRAIN_DIR} \
--dataset_name=my_data\
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_resnet_v2 \
--max_number_of_steps=500000 \
--batch_size=28 \
--num_readers=16 \
--learning_rate=0.1 \
--learning_rate_decay_type=exponential \
--num_epochs_per_decay=4.0 \
--learning_rate_decay_factor=0.9 \
--save_interval_secs=6000 \
--save_summaries_secs=1000 \
--log_every_n_steps=100 \
--optimizer=adam \
--opt_epsilon=1e-1 \
--weight_decay=0.0004 \
--num_clones=6
the running information:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 4
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 4 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 5 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 4 5
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: Y Y Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 4: N N N N Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 5: N N N N Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40m, pci bus id: 0000:0a:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K40m, pci bus id: 0000:0d:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K40m, pci bus id: 0000:0e:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla K40m, pci bus id: 0000:30:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla K40m, pci bus id: 0000:33:00.0)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1902 get requests, put_count=1100 evicted_count=1000 eviction_rate=0.909091 and unsatisfied allocation rate=1
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.29GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.09GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 2.57GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
INFO:tensorflow:global step 100: loss = 26.2026 (3.49 sec/step)
INFO:tensorflow:global step 200: loss = 25.8227 (3.47 sec/step)
INFO:tensorflow:global_step/sec: 0.266924
INFO:tensorflow:global step 300: loss = 25.2874 (3.48 sec/step)
INFO:tensorflow:global step 400: loss = 24.7210 (3.47 sec/step)
INFO:tensorflow:global step 500: loss = 24.2435 (3.47 sec/step)