TF.Keras SparseCategoricalCrossEntropy return nan on GPU

Question

Tried to train UNet on GPU to create binary classified image. Got nan loss on each epoch. Testing of loss function always produces nan-return.

Test case:

import tensorflow as tf
import tensorflow.keras.losses as ls

true = [0.0, 1.0]
pred = [[0.1,0.9],[0.0,1.0]]

tt = tf.convert_to_tensor(true)
tp = tf.convert_to_tensor(pred)

l = ls.SparseCategoricalCrossentropy(from_logits = True)
ret = l(tt,tp)

print(ret) #tf.Tensor(nan, shape=(), dtype=float32)

If i would force my tf to work with CPU (Can Keras with Tensorflow backend be forced to use CPU or GPU at will?), all works fine. And yes, my UNet fits and predicts correctly on CPU.

I checked several posts on keras GitHub, but the all point to problems with compiled ANN, such as using inappropriate optimizers for categorical crossentropy.

Any workaround? Am i missing something?

I could not reproduce this on Google Colab. I suspect that this is due to zero predicted value, since the loss is a weighted sum of log(pred). Of course, since the label of second label if 1, hence the log(0) is not needed for this case. But some implementation may still calculate the loss as the weighted sum literally, which produces NaN. — Kota Mori, Jul 06 '20 at 13:02
@convolutionBoy , i use tensorflow-gpu 2.1.0 package from anacondas default repository — Alexandr Crit, Jul 06 '20 at 13:18
do you get same issue when doing basic operations like tf.add or is it just this particular loss function? — convolutionBoy, Jul 06 '20 at 13:29
@convolutionBoy, categorical_crossentropy seems to work. But i'd rather know what's wrong with SparseCrossentropy. — Alexandr Crit, Jul 06 '20 at 14:22
what happens if you change one of the preds not to be an exact match? i.e 0.0, 1.0 -> 0.01, 0.99? — convolutionBoy, Jul 06 '20 at 15:20

score 2 · Answer 1 · edited Feb 01 '22 at 20:29

2

I had the same issue. My loss was a real number if I trained on CPU. I tried upgrading the TF version, but it didn't fix the problem. I finally fixed my issue by reducing the y dimension. My model output was a 2D array. When I reduced it to 1D, I managed to get a real loss on GPU.

edited Feb 01 '22 at 20:29

Dani

556
7
23

answered Jan 30 '22 at 18:51

Anqi Shen

21
2

score 1 · Accepted Answer · answered Oct 21 '20 at 15:42

The test code you have provided is working fine on google colab.

tf.__version__

2.3

tf.config.list_physical_devices('GPU')

Output:

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Your code:

import tensorflow as tf
import tensorflow.keras.losses as ls

true = [0.0, 1.0]
pred = [[0.1,0.9],[0.0,1.0]]

tt = tf.convert_to_tensor(true)
tp = tf.convert_to_tensor(pred)

l = ls.SparseCategoricalCrossentropy(from_logits = True)
ret = l(tt,tp)

print(ret)

Result:

tf.Tensor(0.8132616, shape=(), dtype=float32)

Then, i guess, it's fixed in newer versions of TF. I used (TF 2.1.0) back then. So technically it's not the answer to the question, but it's still a solution. — Alexandr Crit, Oct 22 '20 at 16:49

TF.Keras SparseCategoricalCrossEntropy return nan on GPU

2 Answers2