Tensorflow: Softmax cross entropy with logits becomes inf

Question

I am working on the Tensorflow for poets tutorial. Most of the time, training fails with an error Nan in summary histogram. I run the following command on the original data to retrain:

python -m scripts.retrain   
   --bottleneck_dir=tf_files/bottlenecks   
   --model_dir=tf_files/models/   
   --summaries_dir=tf_files/training_summaries/"${ARCHITECTURE}"   
   --output_graph=tf_files/retrained_graph.pb   
   --output_labels=tf_files/retrained_labels.txt  
   --image_dir=/ml/data/images

This error occurred in other mentions as well. I followed the instructions there using tfdg which gave me a bit more insight (see below). However, I am still stuck because I do not know why this happens and what I can do to fix it without much experience in TF and neural networks. This is especially confusing because it happens with 100% tutorial code & data.

Here is the output from tfdg. The first time the error appears:

And the node in detail:

To look at the retrain script you can find Google's original code here. It was not modified in my case. Sorry for not including it (too many characters).

Hyper parameters & result For additional information: trainings works with ridiculously small values for learning rate (e.g. using 0,000001). However this does not lead to good results. No matter how many epochs I train, performance stays on a low level (probably being stuck in local minima during optimisation).

Good point, I added the link @Maxim. It is the original Google code for this tutorial. — Gegenwind, Feb 23 '18 at 07:19
@Gegenwind you need to include versions for TF, OS, CUDA (if used) — denfromufa, Feb 28 '18 at 15:12
if you remove the summary histogram, is your training going through or not ? histograms have more difficulties to deal with outliers than training — Vincent Teyssier, Mar 01 '18 at 10:13
It's actually kind of hard to get `tf.nn.softmax_cross_entropy_with_logits` to produce `inf`. I'm doing a few tests and only get good values or `nan` no matter what `logits` is, as long as `labels` is ok (doesn't have `inf` or similar, which I think is your case). Anyway, if each item belongs to one class, you can try [`tf.nn.sparse_softmax_cross_entropy_with_logits`](https://www.tensorflow.org/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits) and see if it's better (you'll have to reshape your data a bit though). — jdehesa, Mar 01 '18 at 11:27
@jdehesa: thanks for your thoughts! I replaced the cross entropy function with alternatives, sparse_softmax as well (classes are exclusive anyway). I still usually get the error. Mostly it occurs in the softmax calculation where the exp becoms too large. I know that Tensorflow implemented a way to prevent softmax from becoming inf (http://python.usyiyi.cn/documents/effective-tf/12.html) but for some reason it still happens. Are you aware of any typical data-related issues that might cause this? — Gegenwind, Mar 05 '18 at 18:35

score 1 · Answer 1 · answered Mar 01 '18 at 09:39

1

Are you sure tf_files folder is being created? I faced some issue on command line python. I switched to spyder and changed the variable data of input as required in retrain.py and it runs smoothly. I know, it's not a solution but a turnaround.

answered Mar 01 '18 at 09:39

Saumya Sambit Acharya

58
8

Thanks for the suggestion. But the folder is there, so this works. It seems to be more of a stability problem within the model. – Gegenwind Mar 02 '18 at 16:58
Which version of python are you using? – Saumya Sambit Acharya Mar 05 '18 at 04:23
I am using version 2.7 – Gegenwind Mar 05 '18 at 09:01
I had searched about compatibility as I was doing in 2.7 also, but it said 3.5 is the best version now with all latest tensorflow support. So I created virtual environment with python 3.5. I think that's why the stability issue. – Saumya Sambit Acharya Mar 05 '18 at 15:17
I will try that out and let you know, thanks for the hint! – Gegenwind Mar 05 '18 at 15:32
1

@SaumyaAmbit Acharya: Thanks for our good intuition and for saving me many more hours of testing. A python 3 instance and it worked quite well. When you add it to your answer I mark it as the correct one. – Gegenwind Mar 06 '18 at 07:17

score 1 · Accepted Answer · answered Mar 07 '18 at 05:19

1

I had searched about compatibility as I was doing in 2.7 also, but it said 3.5 is the best version now with all latest tensorflow support. So I created virtual environment with python 3.5. I think that's why the stability issue.

answered Mar 07 '18 at 05:19

Saumya Sambit Acharya

58
8

Tensorflow: Softmax cross entropy with logits becomes inf

2 Answers2