How to fix "ResourceExhaustedError: OOM when allocating tensor"

Question

I wanna make a model with multiple inputs. So, I try to build a model like this.

# define two sets of inputs
inputA = Input(shape=(32,64,1))
inputB = Input(shape=(32,1024))
 
# CNN
x = layers.Conv2D(32, kernel_size = (3, 3), activation = 'relu')(inputA)
x = layers.Conv2D(32, (3,3), activation='relu')(x)
x = layers.MaxPooling2D(pool_size=(2,2))(x)
x = layers.Dropout(0.2)(x)
x = layers.Flatten()(x)
x = layers.Dense(500, activation = 'relu')(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(500, activation='relu')(x)
x = Model(inputs=inputA, outputs=x)
 
# DNN
y = layers.Flatten()(inputB)
y = Dense(64, activation="relu")(y)
y = Dense(250, activation="relu")(y)
y = Dense(500, activation="relu")(y)
y = Model(inputs=inputB, outputs=y)
 
# Combine the output of the two models
combined = concatenate([x.output, y.output])
 

# combined outputs
z = Dense(300, activation="relu")(combined)
z = Dense(100, activation="relu")(combined)
z = Dense(1, activation="softmax")(combined)

model = Model(inputs=[x.input, y.input], outputs=z)

model.summary()

opt = Adam(lr=1e-3, decay=1e-3 / 200)
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = opt,
    metrics = ['accuracy'])

and the summary : _

But, when i try to train this model,

history = model.fit([trainimage, train_product_embd],train_label,
    validation_data=([validimage,valid_product_embd],valid_label), epochs=10, 
    steps_per_epoch=100, validation_steps=10)

the problem happens.... :

 ResourceExhaustedError                    Traceback (most recent call
 last) <ipython-input-18-2b79f16d63c0> in <module>()
 ----> 1 history = model.fit([trainimage, train_product_embd],train_label,
 validation_data=([validimage,valid_product_embd],valid_label),
 epochs=10, steps_per_epoch=100, validation_steps=10)

 4 frames
 /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py
 in __call__(self, *args, **kwargs)    1470         ret =
 tf_session.TF_SessionRunCallable(self._session._session,    1471      
 self._handle, args,
 -> 1472                                                run_metadata_ptr)    1473         if run_metadata:    1474          
 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
 
 ResourceExhaustedError: 2 root error(s) found.   (0) Resource
 exhausted: OOM when allocating tensor with shape[800000,32,30,62] and
 type float on /job:localhost/replica:0/task:0/device:GPU:0 by
 allocator GPU_0_bfc     [[{{node conv2d_1/convolution}}]] Hint: If you
 want to see a list of allocated tensors when OOM happens, add
 report_tensor_allocations_upon_oom to RunOptions for current
 allocation info.
 
     [[metrics/acc/Mean_1/_185]] Hint: If you want to see a list of
 allocated tensors when OOM happens, add
 report_tensor_allocations_upon_oom to RunOptions for current
 allocation info.
 
   (1) Resource exhausted: OOM when allocating tensor with
 shape[800000,32,30,62] and type float on
 /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc    
 [[{{node conv2d_1/convolution}}]] Hint: If you want to see a list of
 allocated tensors when OOM happens, add
 report_tensor_allocations_upon_oom to RunOptions for current
 allocation info.
 
 0 successful operations. 0 derived errors ignored.

Thanks for reading and hopefully helping me :)

I was using batch size = 1024. After changing to 768 fixed the problem for me. — igorkf, Dec 07 '21 at 14:10

score 59 · Accepted Answer · edited Sep 02 '21 at 14:44

OOM stands for "out of memory". Your GPU is running out of memory, so it can't allocate memory for this tensor. There are a few things you can do:

Decrease the number of filters in your Dense, Conv2D layers
Use a smaller batch_size (or increase steps_per_epoch and validation_steps)
Use grayscale images (you can use tf.image.rgb_to_grayscale)
Reduce the number of layers
Use MaxPooling2D layers after convolutional layers
Reduce the size of your images (you can use tf.image.resize for that)
Use smaller float precision for your input, namely np.float32
If you're using a pre-trained model, freeze the first layers (like this)

There is more useful information about this error:

OOM when allocating tensor with shape[800000,32,30,62]

This is a weird shape. If you're working with images, you should normally have 3 or 1 channel. On top of that, it seems like you are passing your entire dataset at once; you should instead pass it in batches.

its a bit weird that im getting same error (after switching cpu for gpu version) just by doing sequential and adding the layers, not even a fit or anything. shape[173056,4096] and type float. is it normal to get allocation errors when you arent even passing batches? — Mr-Programs, Jun 02 '20 at 02:53
Yes, because a neural network is just a huge matrix of float values, just like input batches. — Nicolas Gervais, Jun 02 '20 at 02:55

score 2 · Answer 2 · answered Dec 18 '19 at 15:25

2

From [800000,32,30,62] it seems your model put all the data in one batch.

Try specified batch size like

history = model.fit([trainimage, train_product_embd],train_label, validation_data=([validimage,valid_product_embd],valid_label), epochs=10, steps_per_epoch=100, validation_steps=10, batch_size=32)

If it still OOM then try reduce the batch_size

answered Dec 18 '19 at 15:25

Natthaphon Hongcharoen

2,244
1
9
23

OP specificed `steps_per_epoch=100` so I don't think that's the case. `batch_size` should be automatically set to `sample_size/steps_per_epoch`. – Nicolas Gervais Dec 18 '19 at 16:05
I tried 'batch_size' first. But, there was an error :"ValueError: If your data is in the form of symbolic tensors, you should specify the `steps_per_epoch` argument (instead of the `batch_size` argument, because symbolic tensors are expected to produce batches of input data)." Thank you for your opinion by the way :) – Robert Dec 18 '19 at 16:16

score 0 · Answer 3 · answered Apr 08 '20 at 11:41

0

Happened to me as well.

You can try reducing trainable parameters by using some form of Transfer Learning - try freezing the initial few layers and use lower batch sizes.

answered Apr 08 '20 at 11:41

Debayan Mitra

53
6

score 0 · Answer 4 · answered Nov 29 '22 at 10:21

0

I think the most common reason for this case to arise would be the absence of MaxPooling layers. Use the same architecture, but add atleast one MaxPool layer after Conv2D layers. This might even improve the overall performance of the model. You can even try reducing the depth of the model, i.e., remove the unnecessary layers and optimize.

answered Nov 29 '22 at 10:21

25_Ananya Y

16
2

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 01 '22 at 15:24

score 0 · Answer 5 · answered Apr 19 '23 at 06:23

0

Solutions :

Reduce your Dimension because of the limited RAM on GPU. like, Nvidia GTX 1060 3gb
Reduce your Batchsize of datagen.flow (by default set 32 so you have to set 8/16/24

answered Apr 19 '23 at 06:23

Osama Arshad

1
2

How to fix "ResourceExhaustedError: OOM when allocating tensor"

5 Answers5

Linked