OOM while allocating GPU memory - TF gpu_options don't work

Question

I’m getting warnings like

W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.62MiB. Current allocation summary follows.

and eventually, program crashes unable to find enough memory as follows -

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[64,160,400] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

I’ve set both gpu options allow_growth and per_process_gpu_memory_fraction as follows -

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.4

estimator = tf.contrib.learn.Estimator(
        model_fn=model_fn,
        model_dir=MODEL_DIR,
        config=tf.contrib.learn.RunConfig(session_config=config))

But, none of the options are working.

Here are the logs -

INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x2aac9e3585c0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.4
  allow_growth: true
}
INFO:tensorflow:Graph was finalized.
2018-04-12 14:51:32.876271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: Tesla K40m major: 3 minor: 5 memoryClockRate(GHz): 0.745
pciBusID: 0000:86:00.0
totalMemory: 11.17GiB freeMemory: 11.09GiB
2018-04-12 14:51:32.876365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-12 14:51:33.690027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4575 MB memory) -> physical GPU (device: 0, name: Tesla K40m, pci bus id: 0000:86:00.0, compute capability: 3.5)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2018-04-12 15:03:36.187286: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.62MiB.  Current allocation summary follows.

Has anyone faced this issue before?

Note that this is happening right when the model is init, therefore the options allow_growth and per_process_gpu_memory_fraction should have solved this problem, but they don’t work.

Any pointers or tips to fix the problem would be helpful.

I've looked at similar questions and issues on Github, but none were useful -
oom-when-allocating-tensor,
Tensorflow doesn't allocate full GPU memory,
Setting session from TrainConfig doesn't seem to work

OOM while allocating GPU memory - TF gpu_options don't work

0 Answers0