How to solve nan loss?

Question

Problem

I'm running a Deep Neural Network on the MNIST where the loss defined as follow:

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, label))

The program seems to run correctly until I get a nan loss in the 10000+ th minibatch. Sometimes, the program runs correctly until it finished. I think tf.nn.softmax_cross_entropy_with_logits is giving me this error. This is strange, because the code just contains mul and add operations.

Possible Solution

Maybe I can use:

if cost == "nan":
  optimizer = an empty optimizer 
else:
  ...
  optimizer = real optimizer

But I cannot find the type of nan. How can I check a variable is nan or not?

How else can I solve this problem?

Check implementation of "tf.add_check_numerics_ops", that adds `Assert` ops to every tensor to make sure there are no nans, so you can use whatever it uses to check for nanness — Yaroslav Bulatov, Oct 20 '16 at 20:37
I am new to tensorflow, when I use "tf.add_check_numerics_ops", it brings me an error "tensorflow.python.framework.errors.InvalidArgumentError: All inputs to node model/CheckNumerics_254 must be from the same frame." Did I use it in a wrong way? — Swind D.C. Xu, Oct 21 '16 at 02:54
I just meant that you can look in implementation of `add_check_numerics_ops` to see which op determines if a variable is NaN, and use that op — Yaroslav Bulatov, Oct 21 '16 at 19:04
Possible duplicate of [Tensorflow Nan loss reasons](https://stackoverflow.com/questions/40050397/tensorflow-nan-loss-reasons) — Seanny123, Jun 01 '17 at 04:54

score 9 · Answer 1 · answered Aug 12 '17 at 15:59

I find a similar problem here TensorFlow cross_entropy NaN problem

Thanks to the author user1111929

tf.nn.softmax_cross_entropy_with_logits => -tf.reduce_sum(y_*tf.log(y_conv))

is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.

Replacing it with

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))

Or

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

Solved nan problem.

score 8 · Answer 2 · answered Dec 11 '16 at 20:13

The reason you are getting NaN's is most likely that somewhere in your cost function or softmax you are trying to take a log of zero, which is not a number. But to answer your specific question about detecting NaN, Python has a built-in capability to test for NaN in the math module. For example:

import math
val = float('nan')
val
if math.isnan(val):
    print('Detected NaN')
    import pdb; pdb.set_trace() # Break into debugger to look around

log(0) = -Infinity as far as I know – Magnus Feb 18 '18 at 01:22 — Magnus, Feb 18 '18 at 01:22

score 7 · Answer 3 · answered Dec 05 '16 at 14:29

7

Check your learning rate. The bigger your network, more parameters to learn. That means you also need to decrease the learning rate.

answered Dec 05 '16 at 14:29

Ilyakom

190
1
6

Fematich · Answer 4 · 2016-10-20T15:43:14.463

2

I don't have your code or data. But tf.nn.softmax_cross_entropy_with_logits should be stable with a valid probability distribution (more info here). I assume your data does not meet this requirement. An analogous problem was also discussed here. Which would lead you to either:

Implement your own softmax_cross_entropy_with_logits function, e.g. try (source):

epsilon = tf.constant(value=0.00001, shape=shape)
logits = logits + epsilon
softmax = tf.nn.softmax(logits)
cross_entropy = -tf.reduce_sum(labels * tf.log(softmax), reduction_indices=[1])

Update your data so that it does have a valid probability distribution

edited Oct 20 '16 at 15:43

answered Oct 20 '16 at 15:37

Fematich

1,588
14
26

I use the standard mnist dataset, I think its probability distribution is valid. – Swind D.C. Xu Oct 20 '16 at 16:00
Why the epsilon is add to logits rather than softmax? – Swind D.C. Xu Oct 20 '16 at 16:02
`epsilon` is added to the logits so that the sum of the resulting softmax is still 1, but cannot contain zeros either (these result in NaN). It's very strange that you have this problem with the standard mnist dataset... Could you check what happens if you use this new `cross_entropy` function? If that doesn't work, you probably have to look at the actual logits. – Fematich Oct 20 '16 at 16:08
Hi, I just found a similar question on SO [here](http://stackoverflow.com/questions/33712178/tensorflow-nan-bug/33713196#33713196) in which case the cross_entropy was adjusted with `clipping`. Although here he started with a very simple implementation of cross_entropy, instead of `tf.nn.softmax_cross_entropy_with_logits`. BTW, did you get it to work now? – Fematich Oct 21 '16 at 10:19

How to solve nan loss?

Problem

Possible Solution

4 Answers4