Loss decreases but weights don't appear to change during tensorflow gradient descent

Question

I have set up a very simple multi-layer perceptron with a single hidden layer using a sigmoid transfer function, and mock data with 2 inputs.

I have tried to set up using the Simple Feedforward Neural Network using TensorFlow example on Github. I won't post the whole thing here but my cost function is set up like this:

# Backward propagation
loss = tensorflow.losses.mean_squared_error(labels=y, predictions=yhat)
cost = tensorflow.reduce_mean(loss, name='cost')
updates = tensorflow.train.GradientDescentOptimizer(0.01).minimize(cost)

Then I simply loop through a bunch of epochs, the intention being that my weights are optimised via the updates operation at every step:

with tensorflow.Session() as sess:
    init = tensorflow.global_variables_initializer()
    sess.run(init)

    for epoch in range(10):

        # Train with each example
        for i in range(len(train_X)):
            feed_dict = {X: train_X[i: i + 1], y: train_y[i: i + 1]}

            res = sess.run([updates, loss], feed_dict)

            print "epoch {}, step {}. w_1: {}, loss: {}".format(epoch, i, w_1.eval(), res[1])

        train_result = sess.run(predict, feed_dict={X: train_X, y: train_y})
        train_errors = abs((train_y - train_result) / train_y)
        train_mean_error = numpy.mean(train_errors, axis=1)

        test_result = sess.run(predict, feed_dict={X: test_X, y: test_y})
        test_errors = abs((test_y - test_result) / test_y)
        test_mean_error = numpy.mean(test_errors, axis=1)

        print("Epoch = %d, train error = %.5f%%, test error = %.5f%%"
              % (epoch, 100. * train_mean_error[0], 100. * test_mean_error[0]))

    sess.close()

I would expect the output of this program to show that at each epoch and for each step the weights would be updated, with a loss value that would broadly decrease over time.

However, while I see the loss value and errors decreasing, the weights only ever change after the first step, and then remain fixed for the remainder of the program.

What is going on here?

Here is what is printed to screen during the first 2 epochs:

epoch 0, step 0. w_1: [[0. 0.]
 [0. 0.]], loss: 492.525634766
epoch 0, step 1. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 482.724365234
epoch 0, step 2. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 454.100799561
epoch 0, step 3. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 418.499267578
epoch 0, step 4. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 387.509033203
Epoch = 0, train error = 84.78731%, test error = 88.31780%
epoch 1, step 0. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 355.381134033
epoch 1, step 1. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 327.519226074
epoch 1, step 2. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 301.841705322
epoch 1, step 3. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 278.177368164
epoch 1, step 4. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 257.852508545
Epoch = 1, train error = 69.24779%, test error = 76.38461%

In addition to not changing, it's also interesting that the weights have the same values for each row. The loss itself keeps decreasing. Here is what the last epoch looks like:

epoch 9, step 0. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 13.5048065186
epoch 9, step 1. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 12.4460296631
epoch 9, step 2. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 11.4702644348
epoch 9, step 3. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 10.5709943771
epoch 9, step 4. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], loss: 10.0332946777
Epoch = 9, train error = 13.49328%, test error = 33.56935%

What am I doing incorrectly here? I know that the weights are being updated somewhere because I can see the training and test errors changing, but why can't I see this?

EDIT: As per squadrick's request here is the code for w_1 and y_hat:

# Layer's sizes
x_size = train_X.shape[1] # Number of input nodes
y_size = train_y.shape[1] # Number of outcomes

# Symbols
X = tensorflow.placeholder("float", shape=[None, x_size], name='X')
y = tensorflow.placeholder("float", shape=[None, y_size], name='y')

# Weight initializations
w_1 = tensorflow.Variable(tensorflow.zeros((x_size, x_size)))
w_2 = tensorflow.Variable(tensorflow.zeros((x_size, y_size)))

# Forward propagation
h = tensorflow.nn.sigmoid(tensorflow.matmul(X, w_1))
yhat = tensorflow.matmul(h, w_2)

EDIT2: squadrick's suggestion to look at w_2 is interesting; when I add w_2 to the print statements with the following;

print "epoch {}, step {}. w_1: {}, w_2: {}, loss: {}".format(epoch, i, w_1.eval(), w_2.eval(), res[1])

I see that it does actually update;

epoch 0, step 0. w_1: [[0. 0.]
 [0. 0.]], w_2: [[0.22192918]
 [0.22192918]], loss: 492.525634766
epoch 0, step 1. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], w_2: [[0.44163907]
 [0.44163907]], loss: 482.724365234
epoch 0, step 2. w_1: [[0.5410637 0.5410637]
 [0.5803371 0.5803371]], w_2: [[0.8678319]
 [0.8678319]], loss: 454.100799561

So now it looks like the issue is that only w_2 is being updated, not w_1. I'm still not sure why this would be happening though.

Could you post the code of where you create `w1` and the code for calculating `yhat`? — squadrick, Apr 07 '18 at 09:20
@squadrick you're right - `w_2` is updating but `w_1` is not. Do you know why this might be? I've updated the question with the new print statement. — quant, Apr 07 '18 at 12:43

Dennis Soemers · Accepted Answer · 2018-04-10T15:40:41.377

You initialize all weights to 0 with this code:

# Weight initializations
w_1 = tensorflow.Variable(tensorflow.zeros((x_size, x_size)))
w_2 = tensorflow.Variable(tensorflow.zeros((x_size, y_size)))

This is problematic, it is much more common to initialize all weights with small random numbers (as done, for example, in your original github link). Even better would be Xavier initalization.

In general, initialization of all weights to (values close to) 0 is problematic, because this can result in gradients of 0 and update magnitudes of 0. This is especially the case if your network involves RELU or tanh activation functions for example.

For more details exactly on the math behind the backpropagation, see, for example, this page.

If I work out the math for your specific case though, it seems like this should not exactly happen (unless I made a mistake somewhere). Indeed, we do see that your w_1 weights get updated once away from 0. Let's try to work out 3 forwards + backwards passes:

a^(l) = activation level in layer l, e^(l) = error in layer l.

First forwards pass:

a^(1) = X
a^(2) = h = sigmoid(matmul(X, w_1)) = sigmoid(matmul(X, 0)) = 0.5
a^(3) = yhat = matmul(h, w_2) = matmul(0.5, 0) = 0

First backwards pass:

e^(3) = cost = reduce_mean(loss) * 1 (the * 1 here is the derivative of the activation function of the output layer).
e^(2) = w_2 e^(3) * (a^(2) * (1 - a^(2))) = 0 ((a^(2) * (1 - a^(2)) here is the derivative of the sigmoid in the hidden layer).
w_2 <-- w_2 + learning_rate * a^(2) * e^(3) (no multiplications by 0, nonzero change in weights)
w_1 <-- w_1 + learning_rate * a^(1) e^(2) (the e^(2) here is 0, so no change in weights this step).

Second forwards pass:

a^(1) = X
a^(2) = h = sigmoid(matmul(X, w_1)) = sigmoid(matmul(X, 0)) = 0.5
a^(3) = yhat = matmul(h, w_2) =/= 0 (not 0 anymore because w_2 was updated)

Second backwards pass:

e^(3) = cost = reduce_mean(loss) * 1
e^(2) = w_2 e^(3) * (a^(2) * (1 - a^(2))) (not 0 anymore because w_2 was updated).
w_2 <-- w_2 + learning_rate * a^(2) * e^(3) (no multiplications by 0, nonzero change in weights)
w_1 <-- w_1 + learning_rate * a^(1) e^(2) (now also nonzero update here).

Third forwards pass:

a^(1) = X
a^(2) = h = sigmoid(matmul(X, w_1)) = ???
a^(3) = yhat = matmul(h, w_2) = ???

Third backwards pass:

e^(3) = cost = reduce_mean(loss)
e^(2) = w_2 e^(3) * (a^(2) * (1 - a^(2)))
w_2 <-- w_2 - learning_rate * a^(2) * e^(3)
w_1 <-- w_1 - learning_rate * a^(1) e^(2)

Now, it looks like, if things continue like this, that w_1 should keep learning. That is, unless one of the following is the case:

The math above is incorrect somewhere, OR
a^(2) becomes (very close to) either all-zero or all-one after w_1 has been updated once

If you look at a plot of the sigmoid curve, you'll see that a^(2) (the activation levels in the hidden layer) may indeed be all close to 0, if the result of matmul(X, w_1) is small (say, < -6), or all close to 1 if the result of matmul(X, w_1) is high. Since your initial losses do seem rather high (about 490), I can imagine that the very first update to w_1 is simply too high in magnitude and causes the hidden layer to be pretty much all-zero or all-one in subsequent iterations.

It could be useful to try verifying this hypothesis by trying to print the values in h. The best solution really would be to just randomly initialize all weights, you also need that to solve another issue (see bottom of the answer). If the hypothesis here is correct, it would probably also be a good idea to have a look at normalizing inputs and/or outputs (do you currently have inputs and/or outputs with really high absolute values?), and/or lowering the learning rate of the GradientDescentOptimizer.

Note that there's a problem with the updates for your w_2 weights as well. They do get updated, but all the weights always have the same values. Even though you manage to get non-zero gradients, and therefore meaningful updates, due to initializing all of these weights to exactly the same value they will always all get exactly the same gradient, exactly the same update, and therefore always all remain exactly the same. This is why it's not sufficient to initialize all weights to 0.01 instead of 0.0 for example; they should all be initialized differently (randomly).

Thanks Dennis, this is a great answer not only because it's correct, but because you explain the process. I'm sure others will benefit from this as well. Cheers! — quant, Apr 14 '18 at 06:08

Loss decreases but weights don't appear to change during tensorflow gradient descent

1 Answers1