Determinism in tensorflow gradient updates?

Question

So I have a very simple NN script written in Tensorflow, and I am having a hard time trying to trace down where some "randomness" is coming in from.

I have recorded the

Weights,
Gradients,
Logits

of my network as I train, and for the first iteration, it is clear that everything starts off the same. I have a SEED value both for how data is read in, and a SEED value for initializing the weights of the net. Those I never change.

My problem is that on say the second iteration of every re-run I do, I start to see the gradients diverge, (by a small amount, like say, 1e-6 or so). However over time, this of course leads to non-repeatable behaviour.

What might the cause of this be? I dont know where any possible source of randomness might be coming from...

Thanks

Do you use GPU? Various ops on GPU are non-deterministic due to their use of CUDA atomics (like tf.reduce_sum) — Yaroslav Bulatov, Oct 08 '16 at 23:16
Also there are some SSE optimizations that result in non-deterministic results, you could try compiling a TensorFlow without any optimizations to see if that's the case (details -- http://blog.nag.com/2011/02/wandering-precision.html) — Yaroslav Bulatov, Oct 08 '16 at 23:17
@YaroslavBulatov Interesting about the optimizations... and also about the GPU part. Does this mean that whether CPU or GPU, we can always expect to get this kind of behaviour? How then, can we home to truly have deterministic results in TF ?... — Spacey, Oct 08 '16 at 23:26
I have a similar issue, see: http://stackoverflow.com/questions/42412660/non-deterministic-gradient-computation — Georg, Feb 23 '17 at 10:13

score 9 · Accepted Answer · edited May 24 '17 at 14:30

9

There's a good chance you could get deterministic results if you run your network on CPU (export CUDA_VISIBLE_DEVICES=), with single-thread in Eigen thread pool (tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=1)), one Python thread (no multi-threaded queue-runners that you get from ops like tf.batch), and a single well-defined operation order. Also using inter_op_parallelism_threads=1 may help in some scenarios.

One issue is that floating point addition/multiplication is non-associative, so one fool-proof way to get deterministic results is to use integer arithmetic or quantized values.

Barring that, you could isolate which operation is non-deterministic, and try to avoid using that op. For instance, there's tf.add_n op, which doesn't say anything about the order in which it sums the values, but different orders produce different results.

Getting deterministic results is a bit of an uphill battle because determinism is in conflict with performance, and performance is usually the goal that gets more attention. An alternative to trying to have exact same numbers on reruns is to focus on numerical stability -- if your algorithm is stable, then you will get reproducible results (ie, same number of misclassifications) even though exact parameter values may be slightly different

edited May 24 '17 at 14:30

gkcn

1,360
1
12
23

answered Oct 08 '16 at 23:43

Yaroslav Bulatov

57,332
22
139
197

(1/2) Thanks Yaroslav, a couple things: 1) however is there an easy way to force TF to just use CPU? (I guess maybe expand on (export CUDA_VISIBLE_DEVICES=)) somewhat? Should I just type that in verbatim into the command line? 2) In regards to the integers/floating point values - are you saying that one experiment I can do is change all my parameters (and related values) to tf.int16 for example, instead of tf.float32 as they are now, to try and get reproducibility, since integer arithmetic will not suffer from the same floating point issues you highlighted? – Spacey Oct 09 '16 at 01:03
(2/2) On the reproducibility, yes, I wanted to try to get this because of a bug I am trying to get to the bottom to. Basically, my (data) loss explodes to very high values (sometimes even a NaN) as my training proceeds. However this only seems to happen when the (data) loss has reached extremely low values to begin with. Sometimes the net recovers, but sometimes not, so this is actually the main problem. :-/ The weird thing is that I am using all TF functions, and the graph is even a skeletonized version of (https://www.tensorflow.org/versions/r0.11/tutorials/deep_cnn/index.html). – Spacey Oct 09 '16 at 01:07
"loss exploding" is a common phenomenon, and it's a property of stochastic gradient descent. Common solution is to lower the learning rate and/or add regularization. – Yaroslav Bulatov Oct 09 '16 at 19:40
The weird thing is that this "exploding loss" seems to happen when the (data) loss itself is really close to 0, (softmax loss btw), and it doesnt seem to happen that much when the loss isnt. Is this what you mean? Lastly, are they any good papers you might recommend about why this phenomenon exists? Thanks Yaroslav! – Spacey Oct 09 '16 at 19:46
Maybe there's a denominator going to zero somewhere? IE, if you are doing logistic regression, and your data becomes perfectly classified, you will get explosion to infinity. Adding L2 regularization on parameters fixes this – Yaroslav Bulatov Oct 09 '16 at 19:50
That was my suspicion, but the thing is I am using one of TF's own examples! (tensorflow.org/versions/r0.11/tutorials/deep_cnn/index.html), and I havent changed anything in their losses, etc. :-/ – Spacey Oct 09 '16 at 20:15
Ah! Wow! Ok - Thanks!! I dont feel that crazy anymore. :) – Spacey Oct 09 '16 at 20:18

score 3 · Answer 2 · answered Mar 13 '18 at 17:17

The tensorflow reduce_sum op is specifically known to be non-deterministic. Furthermore, reduce_sum is used for calculating bias gradients.

This post discusses a workaround to avoid using reduce_sum (ie taking the dot product of any vector w/ a vector of all 1's is the same as reduce_sum)

score 1 · Answer 3 · answered May 28 '18 at 04:39

I have faced the same problem.. The working solution for me was to:

1- use tf.set_random_seed(1) in order to make all tf functions have the same seed every new run

2- Training the model using CPU not the GPU to avoid GPU non-deterministic operations due to precision.

Determinism in tensorflow gradient updates?

3 Answers3

Linked