GradientTape with Keras returns 0

Question

I've tried using GradientTape with a Keras model (simplified) as follows:

import tensorflow as tf
tf.enable_eager_execution()

input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')

import numpy as np
inp = tf.Variable(np.random.random((1,28,28)), dtype=tf.float32, name='input')
target = tf.constant([[1,0,0,0,0,0,0,0,0,0]], dtype=tf.float32)
with tf.GradientTape(persistent=True) as g:
    g.watch(inp)
    result = model(inp, training=False)

print(tf.reduce_max(tf.abs(g.gradient(result, inp))))

But for some random values of inp, the gradient is zero everywhere, and for the rest, the gradient magnitude is really small (<1e-7).

I've also tried this with a MNIST-trained 3-layer MLP and the results are the same, but trying it with a 1-layer Linear model with no activation works.

What's going on here?

I don't think "What is going on here" is really a valid question, what gradient values are you expecting and why do you think there is something wrong? — Dr. Snoopy, May 13 '20 at 10:20
Well, given a random input whose forward-feeding phase gives a wrong classification, then should the gradient be large enough that we can do SGD with a reasonable step size? That's the result I was expecting. Edit: I used to have a loss layer there, and in the process of debugging I was looking for intermediate values in the backprop chain. So my expectation was just something not too small. — kwkt, May 13 '20 at 11:26

score 4 · Accepted Answer · answered May 13 '20 at 11:19

You are computing gradients of a softmax output layer -- since softmax always always sums to 1, it makes sense that the gradients (which, in a multi-putput case, are summed/averaged over dimensions AFAIK) must be 0 -- the overall output of the layer cannot change. The cases where you get small values > 0 are numerical hiccups, I presume.
When you remove the activation function, this limitation no longer holds and the activations can become larger (meaning gradients with magnitude > 0).

Are you trying to use gradient descent to construct inputs that result in a very large probability for a certain class (if not, disregard this...)? @jdehesa already included a way to do this via the loss function. Note that you can do it via the softmax as well, like so:

import tensorflow as tf
tf.enable_eager_execution()

input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')

import numpy as np
inp = tf.Variable(np.random.random((1,28,28)), dtype=tf.float32, name='input')   
with tf.GradientTape(persistent=True) as g:
    g.watch(inp)
    result = model(inp, training=False)[:,0]

print(tf.reduce_max(tf.abs(g.gradient(result, inp))))

Note that I grab only the results in column 0, corresponding to the first class (I removed target because it's not used). This will compute gradients only for the softmax value for this class, which are meaningful.

Some caveats:

It's important to do the indexing inside the gradient tape context manager! If you do it outside (e.g. in the line where you call g.gradient, this will not work (no gradients)
You can also use gradients of the logits (pre-softmax values) instead. This is different, because softmax probabilities can be increased by making other classes less likely, whereas logits can only be increased by increasing the "score" for the class in question.

Thank you for the tips on TF gradients! This question is rather misformed on my end, since I tried simplifying the problem I had and failed. If you will, please check my comment on @jdehesa's answer. — kwkt, May 13 '20 at 14:15

score 2 · Answer 2 · answered May 13 '20 at 10:13

2

Computing the gradients against the output of the model is not usually very meaningful, in general you compute the gradients against the loss, which is what tells the model where the variables should go to reach your goal. In this case, you would be optimizing your input instead of the model parameters, but it is the same.

import tensorflow as tf
import numpy as np
tf.enable_eager_execution()  # Not necessary in TF 2.x

tf.random.set_random_seed(0)  # tf.random.set_seed in TF 2.x
np.random.seed(0)
input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')

inp = tf.Variable(np.random.random((1, 28, 28)), dtype=tf.float32, name='input')
target = tf.constant([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=tf.float32)
with tf.GradientTape(persistent=True) as g:
    g.watch(inp)
    result = model(inp, training=False)
    # Get the loss for the example
    loss = tf.keras.losses.categorical_crossentropy(target, result)

print(tf.reduce_max(tf.abs(g.gradient(loss, inp))))
# tf.Tensor(0.118953675, shape=(), dtype=float32)

answered May 13 '20 at 10:13

jdehesa

58,456
7
77
121

Thank you, I was in the middle of simplifying my code to get a minimum example and forgot to retry the gradient of the loss. However, the problem persists when I tried this with a 3-layer MLP (784 -> 300 -> 100 -> 10, all ReLU except for softmax at the end). I wonder if this deserves a question of its own. – kwkt May 13 '20 at 14:13
@kwkt I just tried that configuration (784 input layer, 300 relu layer, 100 relu layer, 10 softmax output layer) in the script above and the printed value was `0.10540651`. – jdehesa May 13 '20 at 15:04
Untrained, yes. However, with [this](https://filebin.net/vs7351q5zqbearl0/digits.h5) particular model save that I had after training on MNIST for 100 epochs, it gives 0 most of the times. – kwkt May 13 '20 at 15:16
@kwkt That is most likely because you are at a local minimum, or close to it, which is what usually happens after training for a while. It means that changing the variable in any direction will not reduce the loss value, and it corresponds to the point where the loss plot becomes flat. You could think of the [vanishing gradient problem](https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484), but that shouldn't happen in a shallow model with ReLU activation. – jdehesa May 13 '20 at 15:32
But in my case I'm trying to do gradient descent to create an adversarial data point, which means that there has to be some direction that the loss will be decreased, unless the gradient surface is really flat in some epsilon-ball. – kwkt May 14 '20 at 00:47
@kwkt Ah I see what you mean, so you are getting near-zero gradients when you give an input and compute the loss against a different class to what it should have? Yes that should work, I'm not sure what could be the issue then, although I don't know so much about adversarial examples. If you haven't already, you may look into [CleverHans](https://github.com/tensorflow/cleverhans), which is a framework for that specifically. – jdehesa May 14 '20 at 09:06
I actually was using CleverHans as an alternative, and it works (yay). However I'm trying to figure out how to code it myself, hence the question. Worst case scenario, I'll have to dive into CleverHans' source code. – kwkt May 14 '20 at 09:55

GradientTape with Keras returns 0

2 Answers2