7

I am attempting to replicate an deep convolution neural network from a research paper. I have implemented the architecture, but after 10 epochs, my cross entropy loss suddenly increases to infinity. This can be seen in the chart below. You can ignore what happens to the accuracy after the problem occurs.

Here is the github repository with a picture of the architecture

After doing some research I think using an AdamOptimizer or relu might be a problem.

x = tf.placeholder(tf.float32, shape=[None, 7168])
y_ = tf.placeholder(tf.float32, shape=[None, 7168, 3])

#Many Convolutions and Relus omitted

final = tf.reshape(final, [-1, 7168])
keep_prob = tf.placeholder(tf.float32)
W_final = weight_variable([7168,7168,3])
b_final = bias_variable([7168,3])
final_conv = tf.tensordot(final, W_final, axes=[[1], [1]]) + b_final

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=final_conv))
train_step = tf.train.AdamOptimizer(1e-5).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(final_conv, 2), tf.argmax(y_, 2))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

EDIT If anyone is interested, the solution was that I was basically feeding in incorrect data.

Devin Haslam
  • 747
  • 2
  • 12
  • 34
  • After the incident, loss is much lower and accuracy is much higher? Could you reproduce the problem with other setting for random shuffling dataset after each epoch? I doubt it's an accidental adversarial case. – THN Feb 04 '18 at 23:39
  • the question mentions to ignore the accuracy after the problem occurs – Jai Feb 04 '18 at 23:42
  • @Jai Yeah, but why ignore it? It's more intriguing. – THN Feb 05 '18 at 00:12
  • Yeah it is... I assume that it's not the right graph ... – Jai Feb 05 '18 at 01:19
  • The loss goes to 0 because the graph could not show a value for nan(infinity). The accuracy increases because after the problem occurs, the model labels every category "0." It just happens that labeling everything "0" is pretty accurate – Devin Haslam Feb 05 '18 at 01:58

3 Answers3

5

Solution: Control the solution space. This might mean using smaller datasets when training, it might mean using less hidden nodes, it might mean initializing your wb differently. Your model is reaching a point where the loss is undefined, which might be due to the gradient being undefined, or the final_conv signal.

Why: Sometimes no matter what, a numerical instability is reached. Eventually adding a machine epsilon to prevent dividing by zero (cross entropy loss here) just won't help because even then the number cannot be accurately represented by the precision you are using. (Ref: https://en.wikipedia.org/wiki/Round-off_error and https://floating-point-gui.de/basic/)

Considerations:
1) When tweaking epsilons, be sure to be consistent with your data type (Use the machine epsilon of the precision you are using, in your case float32 is 1e-6 ref: https://en.wikipedia.org/wiki/Machine_epsilon and python numpy machine epsilon.

2) Just in-case others reading this are confused: The value in the constructor for Adamoptimizer is the learning rate, but you can set the epsilon value (ref: How does paramater epsilon affects AdamOptimizer? and https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer)

3) Numerical instability of tensorflow is there and its difficult to get around. Yes there is tf.nn.softmax_with_cross_entropy but this is too specific (what if you don't want a softmax?). Refer to Vahid Kazemi's 'Effective Tensorflow' for an insightful explanation: https://github.com/vahidk/EffectiveTensorflow#entropy

Phil P
  • 161
  • 1
  • 6
3

that jump in your loss graph is very weird...

I would like you to focus on few points :

  • if your images are not normalized between 0 and 1 then normalize them
  • if you have normalized your values between -1 and 1 then use a sigmoid layer instead of softmax because softmax squashes the values between 0 and 1
  • before using softmax add a sigmoid layer to squash your values (Highly Recommended)
  • other things you can do is add dropouts for every layer
  • also I would suggest you to use tf.clip so that your gradients does not explode and implode
  • you can also use L2 regularization
  • and experiment with the learning rate and epsilon of AdamOptimizer
  • I would also suggest you to use tensor-board to keep track of the weights so that way you will come to know where the weights are exploding
  • You can also use tensor-board for keeping track of loss and accuracy

  • See The softmax formula below:

enter image description here

  • Probably that e to power of x, the x is being a very large number because of which softmax is giving infinity and hence the loss is infinity
  • Heavily use tensorboard to debug and print the values of the softmax so that you can figure out where you are going wrong
  • One more thing I noticed you are not using any kind of activation functions after the convolution layers... I would suggest you to leaky relu after every convolution layer
  • Your network is a humongous network and it is important to use leaky relu as activation function so that it adds non-linearity and hence improves the performance
Jai
  • 3,211
  • 2
  • 17
  • 26
  • My images are normalized between 0 and 1. I have played around with the learning rate and epsilon of adam optimizer, would you suggest a different one? The original paper does not use leaky relu after every convolution, so I am hesitant to do so. Thank you for the tensorboard recommendation, I am not familiar with it. – Devin Haslam Feb 05 '18 at 16:52
  • Did you use the same data ?... Try using a sigmoid layer before softmax.... Use tf.clip ... Use dropouts – Jai Feb 05 '18 at 17:08
  • I am new to tesnorflow/tensorboard. Given the code that I provided, how would I print out my softmax values? I believe that softmax is calculated at the same time as cross entropy currently. If I want to print cross entropy, currently, I just use cross_entropy.eval() during the training. – Devin Haslam Feb 07 '18 at 15:42
  • 1
    you are right softmax is calculated at the same time as cross entropy... But you can explicitly use `tf.nn.softmax()` as follows: `print(sess.run(tf.nn.softmax(logits)))` – Jai Feb 07 '18 at 19:29
  • Thank you for your help. I will print this out and come back. I decided to try using sigmoid instead of cross entropy, and I do not run into the original problem. I need to figure out how to implement both sigmoid and softmax at the same time and I believe that might fix my problem. – Devin Haslam Feb 09 '18 at 16:21
  • 1
    `logits = tf.nn.sigmoid(output_layer)` and then `softmax_with_cross_entropy(logits, targets)`.... If you find this answer helpful in solving your problem then do not forget to mark it as correct so that others come to know about it – Jai Feb 09 '18 at 20:28
  • I created a new question https://stackoverflow.com/questions/49016723/softmax-cross-entropy-loss-explodes – Devin Haslam Feb 27 '18 at 19:44
1

You may want to use a different value for epsilon in the Adam optimizer (e.g. 0.1 -- 1.0).This is mentioned in the documentation:

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.

RobR
  • 2,160
  • 2
  • 19
  • 32
  • This is not an inception network. I am pretty confident a higher training rate is not the answer. – Devin Haslam Feb 03 '18 at 18:32
  • Epsilon isn't a training rate, it's a regularization factor. And the note gives inception as an example, not a specific requirement. – RobR Feb 03 '18 at 18:34
  • I'm sorry, I misunderstood. Can you offer an explanation for why epsilon would cause this problem? – Devin Haslam Feb 03 '18 at 18:36
  • I can explain it if the change is helpful :-). I don't know that this is the problem but a larger epsilon may help stabilize the adaptive learning rate in the event that the average squared gradient approaches zero. See the original paper linked in the TF documentation. Perhaps just try and see if it makes a difference. – RobR Feb 03 '18 at 18:46
  • 3
    RobR is likely correct. Just to be clear... AdamOptimizer has a few input arguments... learning_rate, beta1, beta2, EPSILON, etc. He means that you should mess with that 4th argument. In your code, you are specifying a value of 1e-5 for the learning_rate, but you are using the default epsilon. Try changing that epsilon. – bremen_matt Feb 03 '18 at 21:41
  • I just finished the new training with an epsilon of .1, and it did not work. Thanks for trying. – Devin Haslam Feb 04 '18 at 01:36
  • Sorry that wasn't the cause in your case -- often it is. Please post if you find the source of the problem. – RobR Feb 04 '18 at 14:07
  • I created a new question https://stackoverflow.com/questions/49016723/softmax-cross-entropy-loss-explodes – Devin Haslam Feb 27 '18 at 19:44