Nan in loss of keras Resnet-50

Question

The validation loss was nan, but training loss is fine.

How could i solve it?

I've confirmed that there is no nan value in dataset.

from tensorflow import keras

base_model = keras.applications.resnet50.ResNet50(include_top = False, weights='imagenet')

for layer in base_model.layers:
    layer.trainable = False

avg = keras.layers.GlobalAveragePooling2D(name="global_avg")(base_model.output)
output = keras.layers.Dense(1, activation = 'sigmoid', name = "predictions")(avg)
model = keras.Model(inputs = base_model.input, outputs = output, name = "ResNet-50")

optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, decay=0.0001, clipnorm = 0.1)
reduce_LROP = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10, verbose=0, mode='auto',
    min_delta=0.0001, cooldown=0, min_lr=0)
model.compile(loss=tf.keras.losses.BinaryCrossentropy(), optimizer = optimizer, metrics = ['accuracy'])

history = model.fit(tri, y_train, epochs = 10, batch_size = 32, validation_data = (vai, y_val),
                    callbacks = [reduce_LROP])

Try to change validation and train data and see what you will get. https://stackoverflow.com/questions/40050397/deep-learning-nan-loss-reasons — dimay, Oct 20 '20 at 06:29

score 0 · Answer 1 · answered Oct 20 '20 at 14:01

I bought GIGABYTE RTX 3080 gaming oc 10GB for deep learning and used it to train a model.

I tested the same script with 4 environments:

3700x + RTX 3080 (CUDA 10.1)
3700x only (no GPU)
Other laptop (i7 8750H + GTX 1050ti)
3700x + RTX 3080 (CUDA 11.0 + cudnn 8.0.3)

The validation loss was fine except for the 1st environment.

Using Tensorflow nightly build and CUDA 11.0 solved the issue in my case.

Nan in loss of keras Resnet-50

1 Answers1