Catch CUDA_ERROR_OUT_OF_MEMORY from a tensorflow script

Question

When you want to train a neural network, you need to set a batch size. The higher the batch size, the higher the GPU memory consumption. When you lack GPU memory, tensorflow will raise this kind of message :

2021-03-29 15:45:04.185417: E tensorflow/stream_executor/cuda/cuda_driver.cc:825] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-29 15:45:04.229570: E tensorflow/stream_executor/cuda/cuda_driver.cc:825] failed to alloc 7730940928 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-29 15:45:10.776120: E tensorflow/stream_executor/cuda/cuda_driver.cc:825] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
...

The solution is to decrease the batch size. I would like to be able to catch this exception when I get this message, so I can send a message to the view, or even automatically decrease the batch size in order to automate the learning behaviour. In my case, the out of memory come from the loading of the dataset :

try:
  features, labels = iter(input_dataset).next()
except:
  print("this is my exception") 
  raise

but, the cuda error oom seems not to be catchable like this. Actually, I think that the error is already caught in the next function of tf.Dataset class. what i see seems to actually be the log generate by the catch of the oom error. I don't know how to detect this log in order to react to the oom event.

Does this answer your question? https://stackoverflow.com/questions/64900712/how-to-write-a-contextmanager-to-throw-and-catch-errors — rok, Mar 29 '21 at 14:29
@rok I tried something like that, but it does not work. I think it is because the error is already catch and logged by the tf.Dataset class. I modified my question accordingly to what i found. — DeepProblems, Mar 30 '21 at 13:19
you can raise resource error like : `except tf.errors.ResourceExhaustedError as e:` — Bijay Regmi, Mar 30 '21 at 13:37

score 0 · Answer 1 · answered Mar 31 '21 at 09:26

The next() method of tf.compat.v1.Dataset which is called by applying :

iter(my_dataset).next()

already catch the OOM error. Then, it just try to generate the next batch after logging the error in the stderr channel. You cannot catch the OOM error by yourself because the tensorflow api already did.

Nevertheless, you can track the error by reading the stderr. In my case i was launching my learning script in command line this way :

process = subprocess.Popen('py -u train.py')

So I just had to change it into :

process = subprocess.Popen('py -u train.py', stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

in order to redirect the stderr to the stdout, and then, parse the stdout :

            while True:
              output = process.stdout.readline()
                if output == '' and process.poll() is not None:
                  break
                if output:
                  log_message = output.strip().decode('utf-8')
                  if "CUDA_ERROR_OUT_OF_MEMORY" in log_message:
                    process.kill()
                    print("please decrease batch_size")
                    break

Catch CUDA_ERROR_OUT_OF_MEMORY from a tensorflow script

1 Answers1