1

I'm running into OOM on a multi-gpu machine, because TF 2.3 seems to be allocating a tensor using only one GPU.

tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_ops.cc:539 : 
Resource exhausted: OOM when allocating tensor with shape[20532,64,48,32] 
and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc.

But tensorflow does recognize multiple GPUs when I run my code:

Adding visible gpu devices: 0, 1, 2

Is there anything else I need to do to have TF use all GPUs?

user2212461
  • 3,105
  • 8
  • 49
  • 87
  • 1
    A tensor with that shape needs 8 GB memory if I'm not wrong. Do you really have enough memory on the machine? – Felix Jun 28 '21 at 15:40
  • each GPU on my machine has 16160MiB so it should be enough? – user2212461 Jun 28 '21 at 15:50
  • Should be enough. I've not really worked with tensorflow myself, just stumbled across your question. Can you check if TF actually sees/has access to the full GPU Memory? Like something answered in https://stackoverflow.com/questions/36123740/is-there-a-way-of-determining-how-much-gpu-memory-is-in-use-by-tensorflow – Felix Jun 28 '21 at 16:39

1 Answers1

0

The direct answer is yes, you do need to do more to get TF to recognize multiple GPUs. You should refer to this guide but the tldr is

mirrored_strategy = tf.distribute.MirroredStrategy()

with mirrored_strategy.scope():
  ...

https://www.tensorflow.org/guide/distributed_training#using_tfdistributestrategy_with_tfkerasmodelfit

But in your case, something else is happening. While this one tensor may be triggering the OOM, it's likely because a few previous large tensors were allocated.

The first dimension, your batch size, is 20532, which is really big. Since the factorization of that is 2**2 × 3 × 29 × 59, I'm going to guess you are working with CHW format and your source image was 3x64x128 which got trimmed after a few convolutions. I'd suspect an inadvertent broadcast. Print a model.summary() and then review the sizes of the tensors coming out of each layer. You may also need to look at your batching.

Yaoshiang
  • 1,713
  • 5
  • 15