Tensorflow gpu - out of memory error, the kernel appears to have died

Question

Here is my classification problem :

Classify pathological images between 2 classes : "Cancer" and "Normal"
Data sets contain respectively 150 000 and 300 000 images
All images are 512x512 rgb .jpg images
The total is about 32 Go

Here is my configuration :

CPU : Intel i7
GPU : Nvidia Geforce RTX 3060 (6 Go)
Python 3.7
Jupyter notebook 6.4.8
Tensorflow 2.6 (tensorflow gpu has been installed as described here https://www.tensorflow.org/install/gpu)

And here is the simple CNN with which I wanted to give a first try:

model = tf.keras.Sequential([
  tf.keras.layers.Rescaling(1./255),
  tf.keras.layers.Conv2D(16, 4, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Conv2D(32, 4, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Conv2D(64, 4, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(num_classes)
])

Unfortunately, it raised several kind of errors during or at the end of the first epoch, like Out of memory error, or "The kernel appears to have died" like reported here How to fix 'The kernel appears to have died. It will restart automatically" caused by pytorch

, or even a black screen without control anymore. I assumed that it was a problem of my GPU that is running out of memory, so I tried several changes according to this post : How to fix "ResourceExhaustedError: OOM when allocating tensor" (notably decreasing the batch size, downsizing and switching images from rgb to grayscale). Nevertheless, I still have issues as described above...

So here are my questions:

Do you think it is still possible to address such problem with my nvidia RTX 3060 GPU?
If yes, do you have any tip that I may be missing?

Bonus question) I used to work on another CNN with 40 000 images in data sets (256x256 grayscale images). The CNN was deeper (4 layers with more filters) and the GPU had less memory (nvidia quadro p600). Nevertheless, I never faced any memory issues. => That's why I am really wondering what is using GPU memory : storing images? neurons weight? something else that I am missing?

Have you seen how many parameters are in your model? 512x512 inputs are also pretty large, ImageNet models use 224x224 in comparison. — Dr. Snoopy, May 13 '22 at 16:52
Hi @Dr.Snoopy, I tried to downsize images to 256x256, 128x128 and up to 64x64. But it never worked... Do you suggest that original images should be 224x224 and not downsize them afterward ? — chalbiophysics, May 13 '22 at 17:10
Again, how many parameters are there? and what is the actual error message? Including code will help, the error message ells you how much RAM is trying to allovate and that tells you how far off you are. — Dr. Snoopy, May 13 '22 at 17:19
With images downsized at 256x256, 'rgb' and batch size = 16, there are 6,931,698 parameters. The error obtained is li,e the one reported here : https://stackoverflow.com/questions/56759112/how-to-fix-the-kernel-appears-to-have-died-it-will-restart-automatically-caus — chalbiophysics, May 13 '22 at 17:28
That is too generic, you said there was an Out of Memory error, include it as text. And by giving information as little drops, people will lose interest to answer your question. — Dr. Snoopy, May 13 '22 at 17:33
I cannot reproduce the out of memroy error anymore, but it said that it was a problem of memory allocation... — chalbiophysics, May 13 '22 at 17:55
Could you please share the rest of your code? At this point we are just speculating ... — eschibli, May 13 '22 at 21:21

eschibli · Answer 1 · 2022-05-13T17:50:41.527

Generally GPU memory issues aren't caused by a large training dataset, they are caused by too large of a network with too large of a batch size.

Back of napkin math, your first dense layer is going to have about a million weights, and the conv layers should have no more than a few thousand each. By comparison, MobileNet has about four million weights and is designed to run inference on mobile devices, so unless you are using a huge batch size you shouldn't have GPU memory issues.

Are you using the tf.data API, or trying to load the entire dataset into your RAM? Using tf.data is the best practice for large datasets, as it allows you to load data and perform data augmentations just-in-time.

Edit: Also, since you are performing classification you probably should have a Softmax activation for your last layer.

Hi @eschibli. Thanks for your answer. I load data with tf.keras.utils.image_dataset_from_directory. So do you suggest that I should try with tf.data ? Actually for the previous CNN that I described at the end of my question, I loaded data with pickle file. Do you think that could explain this error ? — chalbiophysics, May 13 '22 at 17:14

Tensorflow gpu - out of memory error, the kernel appears to have died

1 Answers1