30

I'm trying to train a neural net on a GPU using Keras and am getting a "Resource exhausted: OOM when allocating tensor" error. The specific tensor it's trying to allocate isn't very big, so I assume some previous tensor consumed almost all the VRAM. The error message comes with a hint that suggests this:

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

That sounds good, but how do I do it? RunOptions appears to be a Tensorflow thing, and what little documentation I can find for it associates it with a "session". I'm using Keras, so Tensorflow is hidden under a layer of abstraction and its sessions under another layer below that.

How do I dig underneath everything to set this option in such a way that it will take effect?

dspeyer
  • 2,904
  • 1
  • 18
  • 24

5 Answers5

18

TF1 solution:

Its not as hard as it seems, what you need to know is that according to the documentation, the **kwargs parameter passed to model.compile will be passed to session.run

So you can do something like:

import tensorflow as tf
run_opts = tf.RunOptions(report_tensor_allocations_upon_oom = True)

model.compile(loss = "...", optimizer = "...", metrics = "..", options = run_opts)

And it should be passed directly each time session.run is called.

TF2:

The solution above works only for tf1. For tf2, unfortunately, it appears there is no easy solution yet.

Manuel Popp
  • 1,003
  • 1
  • 10
  • 33
Dr. Snoopy
  • 55,122
  • 7
  • 121
  • 140
4

Currently, it is not possible to add the options to model.compile. See: https://github.com/tensorflow/tensorflow/issues/19911

Richard
  • 514
  • 3
  • 9
  • 4
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Enea Dume Aug 15 '18 at 15:03
3

OOM means out of memory. May be it is using more memory at that time. Decrease batch_size significantly. I set to 16, then it worked fine

naam
  • 471
  • 7
  • 11
  • 2
    Whether that will work, and what batch size is appropriate, will depend entirely on the model in question, as well as the dataset. If one is attempting to debug a memory issue that doesn't depend on batch size, this doesn't help at all. – Adam Azarchs Feb 24 '21 at 23:37
0

Got the same error, but only in case, the training dataset was about the same as my GPU memory. For example, with 4 Gb video card memory I can train the model with the ~3,5 GB dataset. The workaround for me was to create the data_generator custom function, with yield, indices, and lookback. The other way I was suggested was to start learning true tensorflow framework and with tf.Session (example).

ouflak
  • 2,458
  • 10
  • 44
  • 49
-2

OOM is nothing but "OUT OF MEMORY".

TensorFlow throws this error when it runs out of vRAM while loading batches into memory.

I was trying to train a Vision Transformer on CIFAR-100 dataset.

GPU: GTX 1650 w/ 4GB vRAM

Initially, I had the batch_size set to 256, which was totally insane for such a GPU, and I was getting the same OOM error.

I tweaked it to batch_size = 16 (or something lower, which your GPU can handle), training works perfectly fine.

So, always choose a smaller batch_size if you are training on laptops or mid-range GPUs.

Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
  • 1
    This is generally good advice, but does not answer the OP's actual question. Sometimes reducing the batch size has adverse effects (e.g. contrastive learning), so other approaches to reducing memory usage are perhaps desirable. – starbeamrainbowlabs Sep 02 '22 at 16:20