I met the same problem posted by pythonic metaphor which was described here: Why does TensorFlow example fail when increasing batch size? I have read through it and its great answers, but got some further questions.
Let me describe the problem again:
I was looking at the http://www.tensorflow.org/tutorials/mnist/beginners/index.md and found that in this part:
for i in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
Changing the batch size from 100 to be equal or greater than 209 makes the model fail to converge.
cross_entropy is used as loss function here:
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
In the previously mentioned post, I found some great answers proposing 3 ways:
1) dga mentioned that decreasing the learning rate from 0.01
to 0.005
2) colah mentioned that change the reduce_sum
into reduce_mean
while calculating cross_entropy
3) Muaaz mentioned that change the log(y)
into log(y + 1e-10)
while calculating cross_entropy
They all work! The model converge after taking any of the 3 ways.
But my further question is:
According to the 3rd way, the failing reason is log(0)
happens. This is actually proved by printing out y's minimal value during training. log(0)
happens, then cross_entropy becomes Nan
.
But how to explain the 1st and 2nd way? Why do they work?
y is a matrix contains probabilities of each input case being digit '0'~'9', so it's expected that more and more y elements become 0 while training goes on. Neither way 1 nor way 2 could prevent from it.
What is the magic of them?