Why does TensorFlow example fail and why does it work again with increased batch size?

Question

I met the same problem posted by pythonic metaphor which was described here: Why does TensorFlow example fail when increasing batch size? I have read through it and its great answers, but got some further questions.

Let me describe the problem again:
I was looking at the http://www.tensorflow.org/tutorials/mnist/beginners/index.md and found that in this part:

for i in range(1000):  
  batch_xs, batch_ys = mnist.train.next_batch(100)  
  sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

Changing the batch size from 100 to be equal or greater than 209 makes the model fail to converge.

cross_entropy is used as loss function here:

cross_entropy = -tf.reduce_sum(y_*tf.log(y))

In the previously mentioned post, I found some great answers proposing 3 ways:

1) dga mentioned that decreasing the learning rate from 0.01 to 0.005
2) colah mentioned that change the reduce_sum into reduce_mean while calculating cross_entropy
3) Muaaz mentioned that change the log(y) into log(y + 1e-10) while calculating cross_entropy

They all work! The model converge after taking any of the 3 ways.

But my further question is:
According to the 3rd way, the failing reason is log(0) happens. This is actually proved by printing out y's minimal value during training. log(0) happens, then cross_entropy becomes Nan.

But how to explain the 1st and 2nd way? Why do they work?
y is a matrix contains probabilities of each input case being digit '0'~'9', so it's expected that more and more y elements become 0 while training goes on. Neither way 1 nor way 2 could prevent from it.

What is the magic of them?

1) and 2) relate to the magnitude of the added quantity to the weights. If your gradient is large but learning rate is low the added quantity is medium. If you use sum you will be suming over multiple quantities the result might be too large. — jeandut, Jul 13 '16 at 11:47
But log(0) causes value overflow, regardless of sum or mean. — Alexandertech, Jul 15 '16 at 03:33
log(epsilon) with epsilon close to 0 might be very large (at the limit of overflowing) but still not overflowing. Sum(log(epsilon_i)) might be overflowing because adding very large terms. — jeandut, Jul 18 '16 at 07:56

Why does TensorFlow example fail and why does it work again with increased batch size?

0 Answers0