Learning rate larger than 0.001 results in error

Question

I have tried to hack together code from the Udacity Deep Learning course (Assignment 3 - Regularization) and the Tensorflow mnist_with_summaries.py Tutorial. My code appears to run fine

https://github.com/llevar/udacity_deep_learning/blob/master/multi-layer-net.py

but something strange is going on. The assignments all use a learning rate of 0.5, and at some point introduce exponential decay. However, the code I put together only runs fine when I set the learning rate to 0.001 (with decay or without). If I set the initial rate at 0.1 or greater I get the following error:

Traceback (most recent call last):
  File "/Users/siakhnin/Documents/workspace/udacity_deep_learning/multi-layer-net.py", line 175, in <module>
    summary, my_accuracy, _ = my_session.run([merged, accuracy, train_step], feed_dict=feed_dict)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 340, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 564, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 637, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 659, in _do_call
    e.code)
tensorflow.python.framework.errors.InvalidArgumentError: Nan in summary histogram for: layer1/weights/summaries/HistogramSummary
     [[Node: layer1/weights/summaries/HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](layer1/weights/summaries/HistogramSummary/tag, layer1/weights/Variable/read)]]
Caused by op u'layer1/weights/summaries/HistogramSummary', defined at:
  File "/Users/siakhnin/Documents/workspace/udacity_deep_learning/multi-layer-net.py", line 106, in <module>
    layer1, weights_1 = nn_layer(x, num_features, 1024, 'layer1')
  File "/Users/siakhnin/Documents/workspace/udacity_deep_learning/multi-layer-net.py", line 79, in nn_layer
    variable_summaries(weights, layer_name + '/weights')
  File "/Users/siakhnin/Documents/workspace/udacity_deep_learning/multi-layer-net.py", line 65, in variable_summaries
    tf.histogram_summary(name, var)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/logging_ops.py", line 113, in histogram_summary
    tag=tag, values=values, name=scope)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 55, in _histogram_summary
    name=name)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2154, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1154, in __init__
    self._traceback = _extract_stack()

If I set the rate at 0.001 then the code runs to completion with a test accuracy of 0.94.

Using tensorflow 0.8 RC0 on Mac OS X.

Yaroslav Bulatov · Accepted Answer · 2016-04-17T19:47:04.267

5

Looks like your training is diverging (which causes you to get infinities or NaNs). There's no simple explanation for why things diverge under some set of conditions but not others, but generally higher learning rate makes it more likely to diverge.

Edit, Apr 17 You are getting a NaN in your Histogram summary which most likely means there's a NaN in your weights or activations. NaNs are caused by numerically improper calculations, ie taking log of 0 and multiplying result by 0. There's also a small chance there's some bug in histograms, to rule out this, turn off summaries, and see if you are still able to train to good accuracy.

To turn off summaries, replace this line merged = tf.merge_all_summaries()

with this

merged = tf.constant(1)

and comment out this line

test_writer.add_summary(summary)

edited Apr 17 '16 at 19:47

answered Apr 16 '16 at 23:11

Yaroslav Bulatov

57,332
22
139
197

Hi Yaroslav, Thanks for your reply. Can you help me unpack the error message a bit? How do I debug this? It seems to occur in the second epoch, so a bit too fast to diverge. Is it the histogram that's the problem or the weights? Is it possible to step through tensorflow execution with a conventional debugger? I seem to be able to use a high learning rate of 0.5 with the out-of-the-box examples in Udacity and tensorflow tutorials so I fear it may be some subtle bug in my code that causes things to behave this way. Thanks for your help. – llevar Apr 17 '16 at 18:57
When I comment out the summaries and bump up the learning rate to 0.1 the program no longer crashes but it does not learn either. Accuracy stays at 10% over several thousand epochs. Since training data is constant and starting weights are samples from a normal (0, 0.1) I expect runs of my code to be similar to udacity code, yet the theirs runs fine with a rate of even 0.5. Is the right way to debug (to look for an unexpectedly large error gradient for example) collecting a return value from the session.run method, or are there other more convenient ways to interrogate the system state? – llevar Apr 18 '16 at 19:58
Does the training loss go down? If it doesn't go down, then you have a problem with optimization (ie, optimization is getting stuck). If you are getting NaNs in training loss, you are getting divergence. It's also possible it's going down, but very slowly, so you need a million epochs to see difference. Sometimes, with ReLUs the training explodes, but instead of NaNs you get zeros in your activations. One way to interrogate system state is to compute nodes that do stats (compute average activations + gradient magnitude) and then look at those values. – Yaroslav Bulatov Apr 18 '16 at 20:32

score 0 · Answer 2 · edited May 23 '17 at 12:01

0

You cross entropy:

diff = y_ * tf.log(y)

may also want to consider the case 0*log(0)

You can change it to :

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

source: Tensorflow NaN bug?

edited May 23 '17 at 12:01

Community

1
1

answered May 19 '16 at 02:24

DMTishler

501
1
6
12

Learning rate larger than 0.001 results in error

2 Answers2

Linked