6

I'm trying out TensorFlow and I'm running into a strange error. I edited the deep MNIST example to use another set of images, and the algorithm converges nicely again, until around iteration 8000 (accuracy 91% at that point) when it crashes with the following error.

tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input is not finite

At first I thought maybe some coefficients were reaching the limit for a float, but adding l2 regularization on all weights & biases didn't resolve the issue. It's always the first relu application that comes out of the stacktrace:

h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

I'm working on CPU only for now. Any idea what could cause this and how to work around it?

Edit: I traced it down to this problem Tensorflow NaN bug?, the solution there works.

Community
  • 1
  • 1
user1111929
  • 6,050
  • 9
  • 43
  • 73
  • I also noticed that if the line `train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)` I modify the value to 1e-3, the crash occurs significantly earlier. However, changing it to 1e-5 prevents the algorithm from converging. – user1111929 Nov 13 '15 at 19:46
  • For Adam, you might want to increase the `epsilon` argument. The current default is `epsilon=1e-8`. Look at the documentation. It says "For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1." Also see [this](https://github.com/tensorflow/tensorflow/issues/323#issuecomment-159116515) discussion. – Albert Jul 14 '17 at 11:36

4 Answers4

10

Error is due to 0log(0)

This can be avoided by:

cross_entropy = -tf.reduce_sum(y*tf.log(yconv+ 1e-9))
Muaaz
  • 261
  • 1
  • 7
3

Since I had another topic on this issue [ Tensorflow NaN bug? ] I didn't keep this one updated, but the solution has been there for a while and has since been echoed by posters here. The problem is indeed 0*log(0) resulting in an NaN.

One option is to use the line Muaaz suggests here or the one I wrote in the linked topic. But in the end TensorFlow has this routine embedded: tf.nn.softmax_cross_entropy_with_logits and this is more efficient and should hence be preferred when possible. This should be used where possible instead of the things me and Muaaz suggested earlier, as pointed out by a commentor on said link.

Community
  • 1
  • 1
user1111929
  • 6,050
  • 9
  • 43
  • 73
1

I have experienced this error: input is not finite earlier (not with tf.nn.relu). In my case the problem was that the elements in my tensor variable reached very big number (which marked them as infinite and hence the message input is not finite).

I would suggest to add a bunch of debugging output to tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) at every n-th iteration to track when exactly it reached infinity.

This looks consistent with your comment:

I modify the value to 1e-3, the crash occurs significantly earlier. However, changing it to 1e-5 prevents the algorithm from converging

Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
  • Couly you elaborate why this is consistent with my comment? I've added a clipping: relu(clip(...)) and now I get NaN values in my output instead of infinity, so I feel that this is not the root of the problem. Could it be that once a solution gets good, the optimization algorithm can't do anything anymore, and crashes (or does 0/0) instead of just stopping? Not sure how to continue if that's the case. – user1111929 Nov 14 '15 at 13:48
  • As for adding debugging info, I did and some values effectively keep growing. Logical: convolution makes them increase by a large factor. The growth is so large that regularization doesn't help enough (or at least makes the algorithm inefficient before actually solving this issue). Clipping apparently doesn't help either. Neither did replacing relu by softplus (then the algorithm doesn't converge to a good classifier anymore). Any other ideas what I could try? – user1111929 Nov 14 '15 at 13:50
  • @user1111929 if they reached an infinity, then this question is solved. Ask another question of how to deal with `relu` to prevent it from reaching infinity. – Salvador Dali Nov 14 '15 at 21:34
  • 1
    Another possible explanation might be that your input data is outside of the [0,1] range that is used in the examples. Try to rescale your input data and see how that change the outcome as well. – Daniel Zakrisson Nov 17 '15 at 12:54
1

Can't comment because of reputation, but Muaaz has the answer. The error can be replicated by training a system which has 0 error - resulting in log(0). His solution works to prevent this. Alternatively catch the error and move on.

...your other code...
try :
  for i in range(10000):
    train_accuracy = accuracy.eval(feed_dict={
            x:batch_xs, y_: batch_ys, keep_prob: 1.0})

except : print("training interupted. Hopefully deliberately")
The Puternerd
  • 632
  • 4
  • 12