Non-deterministic Gradient Computation

Question

I realized that my models end up being different every time I train them, even though I keep the TensorFlow random seed the same.

I verified that:

Initialization is deterministic; the weights are identical before the first update.
Inputs are deterministic. In fact, various forward computations, including the loss, are identical for the very first batch.
The gradients for the first batch are different. Concretely, I'm comparing the outputs of tf.gradients(loss, train_variables). While loss and train_variables have identical values, the gradients are sometimes different for some of the Variables. The differences are quite significant (sometimes the sum-of-absolute-differences for a single variable's gradient is greater than 1).

I conclude that it's the gradient computation that causes the non-determinism. I had a look at this question and the problem persists when running on a CPU with intra_op_parallelism_thread=1 and inter_op_parallelism_thread=1.

How can the backward pass be non-deterministic when the forward pass isn't? How could I debug this further?

jabalazs · Answer 1 · 2019-06-30T18:52:04.143

-1

This answer might seem a little obvious, but do you use some kind of non-deterministic regularization such as dropout? Given that dropout "drops" some connections randomly when training, it may be causing that difference on the gradients.

Edit: Similar questions:

Edit 2: This seems to be an issue with TensorFlow's implementation. See the following open issues in GitHub:

edited Jun 30 '19 at 18:52

answered Feb 23 '17 at 11:44

jabalazs

1,214
11
17

If that was the case, forward computation would be different too, right? Also, shouldn't the (static) seed determine which connections are dropped? – Georg Feb 23 '17 at 12:24
Do you mean forward computation while training or when evaluating? The usual practice is to disable dropout when validating. How is it implemented in your code? And yes, the random seed should determine which connections are dropped. Perhaps you're having similar problems to those mentioned in [this](http://stackoverflow.com/a/36289575/3941813) question. – jabalazs Feb 23 '17 at 12:29
I mean forward computation while training. I use `tf.nn.dropout()`, but I just checked, even with `keep_prob == 1` the issue persists. I also checked that only one graph is instantiated (`tf.Graph()` is actually never called in the code). – Georg Feb 23 '17 at 16:13
Could you please provide the rest of your code to help you better diagnose the issue? – jabalazs Feb 23 '17 at 23:56
I'm afraid this is a huge model that I developed for my master thesis. We're looking at several hundred LOC, distributed over several classes. I don't think it would help... – Georg Feb 24 '17 at 14:21

Non-deterministic Gradient Computation

1 Answers1

Linked