When you want to train a neural network, you need to set a batch size. The higher the batch size, the higher the GPU memory consumption. When you lack GPU memory, tensorflow will raise this kind of message :
2021-03-29 15:45:04.185417: E tensorflow/stream_executor/cuda/cuda_driver.cc:825] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-29 15:45:04.229570: E tensorflow/stream_executor/cuda/cuda_driver.cc:825] failed to alloc 7730940928 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-29 15:45:10.776120: E tensorflow/stream_executor/cuda/cuda_driver.cc:825] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
...
The solution is to decrease the batch size. I would like to be able to catch this exception when I get this message, so I can send a message to the view, or even automatically decrease the batch size in order to automate the learning behaviour. In my case, the out of memory come from the loading of the dataset :
try:
features, labels = iter(input_dataset).next()
except:
print("this is my exception")
raise
but, the cuda error oom seems not to be catchable like this. Actually, I think that the error is already caught in the next function of tf.Dataset class. what i see seems to actually be the log generate by the catch of the oom error. I don't know how to detect this log in order to react to the oom event.