0

The validation loss was nan, but training loss is fine.

How could i solve it?

I've confirmed that there is no nan value in dataset.

from tensorflow import keras

base_model = keras.applications.resnet50.ResNet50(include_top = False, weights='imagenet')

for layer in base_model.layers:
    layer.trainable = False

avg = keras.layers.GlobalAveragePooling2D(name="global_avg")(base_model.output)
output = keras.layers.Dense(1, activation = 'sigmoid', name = "predictions")(avg)
model = keras.Model(inputs = base_model.input, outputs = output, name = "ResNet-50")

optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, decay=0.0001, clipnorm = 0.1)
reduce_LROP = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10, verbose=0, mode='auto',
    min_delta=0.0001, cooldown=0, min_lr=0)
model.compile(loss=tf.keras.losses.BinaryCrossentropy(), optimizer = optimizer, metrics = ['accuracy'])

history = model.fit(tri, y_train, epochs = 10, batch_size = 32, validation_data = (vai, y_val),
                    callbacks = [reduce_LROP])

enter image description here

Crispy13
  • 230
  • 1
  • 3
  • 16
  • 1
    Try to change validation and train data and see what you will get. https://stackoverflow.com/questions/40050397/deep-learning-nan-loss-reasons – dimay Oct 20 '20 at 06:29

1 Answers1

0

I bought GIGABYTE RTX 3080 gaming oc 10GB for deep learning and used it to train a model.

I tested the same script with 4 environments:

  1. 3700x + RTX 3080 (CUDA 10.1)
  2. 3700x only (no GPU)
  3. Other laptop (i7 8750H + GTX 1050ti)
  4. 3700x + RTX 3080 (CUDA 11.0 + cudnn 8.0.3)

The validation loss was fine except for the 1st environment.

Using Tensorflow nightly build and CUDA 11.0 solved the issue in my case.

Crispy13
  • 230
  • 1
  • 3
  • 16