How to add report_tensor_allocations_upon_oom to RunOptions in Keras

Question

I'm trying to train a neural net on a GPU using Keras and am getting a "Resource exhausted: OOM when allocating tensor" error. The specific tensor it's trying to allocate isn't very big, so I assume some previous tensor consumed almost all the VRAM. The error message comes with a hint that suggests this:

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

That sounds good, but how do I do it? RunOptions appears to be a Tensorflow thing, and what little documentation I can find for it associates it with a "session". I'm using Keras, so Tensorflow is hidden under a layer of abstraction and its sessions under another layer below that.

How do I dig underneath everything to set this option in such a way that it will take effect?

score 18 · Accepted Answer · edited Oct 02 '21 at 17:39

18

TF1 solution:

Its not as hard as it seems, what you need to know is that according to the documentation, the **kwargs parameter passed to model.compile will be passed to session.run

So you can do something like:

import tensorflow as tf
run_opts = tf.RunOptions(report_tensor_allocations_upon_oom = True)

model.compile(loss = "...", optimizer = "...", metrics = "..", options = run_opts)

And it should be passed directly each time session.run is called.

TF2:

The solution above works only for tf1. For tf2, unfortunately, it appears there is no easy solution yet.

edited Oct 02 '21 at 17:39

Manuel Popp

1,003
1
10
33

answered Apr 05 '18 at 14:47

Dr. Snoopy

55,122
7
121
140

1

I used options=run_opts, since it's a kwargs thing, and that worked – dspeyer Apr 07 '18 at 05:52
6

@Matias Valdenegro I get `ValueError: ('Some keys in session_kwargs are not supported at this time: %s', dict_keys(['options']))`. Any idea what I'm doing wrong? – Amila Jul 22 '18 at 09:58
1

I had this exact same issue. Using keras version 2.2.4... is there any solution? – zwep Dec 10 '18 at 18:00
Also, I received this error 'Protocol message RunOptions has no "report_tensor_allocations_upon_oom" field.' – zwep Dec 11 '18 at 11:12
@zwep this was resolved in 2.2.4. Are you sure you've updated? – Dan Grahn Feb 27 '19 at 12:57
1

This caused me a segmentation fault for some reason: `[1] 3957 segmentation fault python oom_net.py` – Zaccharie Ramzi Jul 31 '19 at 09:55
Apparently I was not the only one with a segfault: https://github.com/keras-team/keras/issues/11322 – Zaccharie Ramzi Jul 31 '19 at 11:55
How can I do it for an Estimator? – Alessandro Aug 15 '22 at 09:39

score 4 · Answer 2 · answered Aug 15 '18 at 14:44

4

Currently, it is not possible to add the options to model.compile. See: https://github.com/tensorflow/tensorflow/issues/19911

answered Aug 15 '18 at 14:44

Richard

514
3
9

4

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Enea Dume Aug 15 '18 at 15:03

score 3 · Answer 3 · answered Oct 06 '20 at 17:35

3

OOM means out of memory. May be it is using more memory at that time. Decrease batch_size significantly. I set to 16, then it worked fine

answered Oct 06 '20 at 17:35

naam

471
7
11

2

Whether that will work, and what batch size is appropriate, will depend entirely on the model in question, as well as the dataset. If one is attempting to debug a memory issue that doesn't depend on batch size, this doesn't help at all. – Adam Azarchs Feb 24 '21 at 23:37

score 0 · Answer 4 · edited Feb 02 '22 at 13:38

Got the same error, but only in case, the training dataset was about the same as my GPU memory. For example, with 4 Gb video card memory I can train the model with the ~3,5 GB dataset. The workaround for me was to create the data_generator custom function, with yield, indices, and lookback. The other way I was suggested was to start learning true tensorflow framework and with tf.Session (example).

score -2 · Answer 5 · edited Jun 16 '22 at 16:21

-2

OOM is nothing but "OUT OF MEMORY".

TensorFlow throws this error when it runs out of vRAM while loading batches into memory.

I was trying to train a Vision Transformer on CIFAR-100 dataset.

GPU: GTX 1650 w/ 4GB vRAM

Initially, I had the batch_size set to 256, which was totally insane for such a GPU, and I was getting the same OOM error.

I tweaked it to batch_size = 16 (or something lower, which your GPU can handle), training works perfectly fine.

So, always choose a smaller batch_size if you are training on laptops or mid-range GPUs.

edited Jun 16 '22 at 16:21

Sunderam Dubey

1
11
20
40

answered Jun 15 '22 at 15:44

Pranav Durai

9
4

1

This is generally good advice, but does not answer the OP's actual question. Sometimes reducing the batch size has adverse effects (e.g. contrastive learning), so other approaches to reducing memory usage are perhaps desirable. – starbeamrainbowlabs Sep 02 '22 at 16:20

How to add report_tensor_allocations_upon_oom to RunOptions in Keras

5 Answers5

Linked