I've tried using GradientTape
with a Keras model (simplified) as follows:
import tensorflow as tf
tf.enable_eager_execution()
input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')
import numpy as np
inp = tf.Variable(np.random.random((1,28,28)), dtype=tf.float32, name='input')
target = tf.constant([[1,0,0,0,0,0,0,0,0,0]], dtype=tf.float32)
with tf.GradientTape(persistent=True) as g:
g.watch(inp)
result = model(inp, training=False)
print(tf.reduce_max(tf.abs(g.gradient(result, inp))))
But for some random values of inp
, the gradient is zero everywhere, and for the rest, the gradient magnitude is really small (<1e-7).
I've also tried this with a MNIST-trained 3-layer MLP and the results are the same, but trying it with a 1-layer Linear model with no activation works.
What's going on here?