4

I am working on the Tensorflow for poets tutorial. Most of the time, training fails with an error Nan in summary histogram. I run the following command on the original data to retrain:

python -m scripts.retrain   
   --bottleneck_dir=tf_files/bottlenecks   
   --model_dir=tf_files/models/   
   --summaries_dir=tf_files/training_summaries/"${ARCHITECTURE}"   
   --output_graph=tf_files/retrained_graph.pb   
   --output_labels=tf_files/retrained_labels.txt  
   --image_dir=/ml/data/images

This error occurred in other mentions as well. I followed the instructions there using tfdg which gave me a bit more insight (see below). However, I am still stuck because I do not know why this happens and what I can do to fix it without much experience in TF and neural networks. This is especially confusing because it happens with 100% tutorial code & data.

Here is the output from tfdg. The first time the error appears:

tfdg output for the node with the error And the node in detail:

enter image description here

To look at the retrain script you can find Google's original code here. It was not modified in my case. Sorry for not including it (too many characters).

Hyper parameters & result For additional information: trainings works with ridiculously small values for learning rate (e.g. using 0,000001). However this does not lead to good results. No matter how many epochs I train, performance stays on a low level (probably being stuck in local minima during optimisation).

Gegenwind
  • 1,388
  • 1
  • 17
  • 27
  • You should include the code – Maxim Feb 22 '18 at 13:46
  • Good point, I added the link @Maxim. It is the original Google code for this tutorial. – Gegenwind Feb 23 '18 at 07:19
  • @Gegenwind you need to include versions for TF, OS, CUDA (if used) – denfromufa Feb 28 '18 at 15:12
  • if you remove the summary histogram, is your training going through or not ? histograms have more difficulties to deal with outliers than training – Vincent Teyssier Mar 01 '18 at 10:13
  • It's actually kind of hard to get `tf.nn.softmax_cross_entropy_with_logits` to produce `inf`. I'm doing a few tests and only get good values or `nan` no matter what `logits` is, as long as `labels` is ok (doesn't have `inf` or similar, which I think is your case). Anyway, if each item belongs to one class, you can try [`tf.nn.sparse_softmax_cross_entropy_with_logits`](https://www.tensorflow.org/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits) and see if it's better (you'll have to reshape your data a bit though). – jdehesa Mar 01 '18 at 11:27
  • @jdehesa: thanks for your thoughts! I replaced the cross entropy function with alternatives, sparse_softmax as well (classes are exclusive anyway). I still usually get the error. Mostly it occurs in the softmax calculation where the exp becoms too large. I know that Tensorflow implemented a way to prevent softmax from becoming inf (http://python.usyiyi.cn/documents/effective-tf/12.html) but for some reason it still happens. Are you aware of any typical data-related issues that might cause this? – Gegenwind Mar 05 '18 at 18:35

2 Answers2

1

Are you sure tf_files folder is being created? I faced some issue on command line python. I switched to spyder and changed the variable data of input as required in retrain.py and it runs smoothly. I know, it's not a solution but a turnaround.

1

I had searched about compatibility as I was doing in 2.7 also, but it said 3.5 is the best version now with all latest tensorflow support. So I created virtual environment with python 3.5. I think that's why the stability issue.