I am working on the Tensorflow for poets tutorial. Most of the time, training fails with an error Nan in summary histogram
.
I run the following command on the original data to retrain:
python -m scripts.retrain
--bottleneck_dir=tf_files/bottlenecks
--model_dir=tf_files/models/
--summaries_dir=tf_files/training_summaries/"${ARCHITECTURE}"
--output_graph=tf_files/retrained_graph.pb
--output_labels=tf_files/retrained_labels.txt
--image_dir=/ml/data/images
This error occurred in other mentions as well. I followed the instructions there using tfdg which gave me a bit more insight (see below). However, I am still stuck because I do not know why this happens and what I can do to fix it without much experience in TF and neural networks. This is especially confusing because it happens with 100% tutorial code & data.
Here is the output from tfdg. The first time the error appears:
To look at the retrain script you can find Google's original code here. It was not modified in my case. Sorry for not including it (too many characters).
Hyper parameters & result For additional information: trainings works with ridiculously small values for learning rate (e.g. using 0,000001). However this does not lead to good results. No matter how many epochs I train, performance stays on a low level (probably being stuck in local minima during optimisation).