OOM Error running resnet model tensorflow

Question

I am running the resenet model on EC2 g2(NVIDIA GRID K520) instance and seeing a OOM Error. I have tried various combinations of removing the code that uses a GPU, prefixing CUDA_VISIBLE_DEVICES='0' and also reducing the batch_size to 64. I am still failing to start the training. Can you help?

W tensorflow/core/common_runtime/bfc_allocator.cc:270] **********************x***************************************************************************xx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 196.00MiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[64,16,224,224]
E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[64,16,224,224]
     [[Node: unit_1_2/sub1/conv1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](unit_1_2/residual_only_activation/leaky_relu, unit_1_2/sub1/conv1/DW/read)]]
     [[Node: train_step/update/_1561 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10115_train_step/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Traceback (most recent call last):
  File "./resnet_main.py", line 203, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "./resnet_main.py", line 197, in main
    train(hps)
  File "./resnet_main.py", line 82, in train
    feed_dict={model.lrn_rate: lrn_rate})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[64,16,224,224]
     [[Node: unit_1_2/sub1/conv1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](unit_1_2/residual_only_activation/leaky_relu, unit_1_2/sub1/conv1/DW/read)]]
     [[Node: train_step/update/_1561 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10115_train_step/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'unit_1_2/sub1/conv1/Conv2D', defined at:
  File "./resnet_main.py", line 203, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "./resnet_main.py", line 197, in main
    train(hps)
  File "./resnet_main.py", line 64, in train
    model.build_graph()
  File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 59, in build_graph
    self._build_model()
  File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 94, in _build_model
    x = res_func(x, filters[1], filters[1], self._stride_arr(1), False)
  File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 208, in _residual
    x = self._conv('conv1', x, 3, in_filter, out_filter, stride)
  File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 279, in _conv
    return tf.nn.conv2d(x, kernel, strides, padding='SAME')
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 394, in conv2d
    data_format=data_format, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
    self._traceback = _extract_stack()

Apparently it is still using the GPU with "use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0". Could you run with a batch_size of 1? It might be that the model is simply too large and too memory consuming. Could you look up how much memory does this GPU have? You can also set flag num_gpus 0 to run on the CPU. — Yao Zhang, Oct 05 '16 at 00:14

score 1 · Answer 1 · answered Sep 13 '17 at 14:02

The NVIDIA GRID K520 has 8GB of memory (link). I have successfully trained ResNet models on a NVIDIA GPU with 12GB of memory. As the error suggests, TensorFlow attempts to put all of the networks weights in the GPU memory and fails. I believe you have a few options:

Train only on the CPU, as mentioned in the comments, assuming your CPU has more than 8GB of memory. This is not recommended.
Train a different network with fewer parameters. Several networks have been released since Resnet, such as Inception-v4, Inception-ResNet, with fewer parameters and comparable accuracy. This option costs nothing to try!
Buy a GPU with more memory. Easiest option if you have the money.
Buy another GPU with the same memory and train the bottom half of the network on one, and the top half of the network on the other. The difficulty in communicating between the GPUs makes this option less desirable.

I hope this helps you and others that run into similar memory issues.

He can also reduce the batch size. Isn't it? – Maruf Jan 09 '18 at 01:46 — Maruf, Jan 09 '18 at 01:46

OOM Error running resnet model tensorflow

1 Answers1