I am running the resenet model on EC2 g2(NVIDIA GRID K520
) instance and seeing a OOM
Error. I have tried various combinations of removing the code that uses a GPU
, prefixing CUDA_VISIBLE_DEVICES='0'
and also reducing the batch_size
to 64. I am still failing to start the training. Can you help?
W tensorflow/core/common_runtime/bfc_allocator.cc:270] **********************x***************************************************************************xx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 196.00MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[64,16,224,224]
E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[64,16,224,224]
[[Node: unit_1_2/sub1/conv1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](unit_1_2/residual_only_activation/leaky_relu, unit_1_2/sub1/conv1/DW/read)]]
[[Node: train_step/update/_1561 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10115_train_step/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Traceback (most recent call last):
File "./resnet_main.py", line 203, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "./resnet_main.py", line 197, in main
train(hps)
File "./resnet_main.py", line 82, in train
feed_dict={model.lrn_rate: lrn_rate})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[64,16,224,224]
[[Node: unit_1_2/sub1/conv1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](unit_1_2/residual_only_activation/leaky_relu, unit_1_2/sub1/conv1/DW/read)]]
[[Node: train_step/update/_1561 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10115_train_step/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'unit_1_2/sub1/conv1/Conv2D', defined at:
File "./resnet_main.py", line 203, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "./resnet_main.py", line 197, in main
train(hps)
File "./resnet_main.py", line 64, in train
model.build_graph()
File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 59, in build_graph
self._build_model()
File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 94, in _build_model
x = res_func(x, filters[1], filters[1], self._stride_arr(1), False)
File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 208, in _residual
x = self._conv('conv1', x, 3, in_filter, out_filter, stride)
File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 279, in _conv
return tf.nn.conv2d(x, kernel, strides, padding='SAME')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 394, in conv2d
data_format=data_format, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
self._traceback = _extract_stack()