It seems like I have an exploding gradient issue during the training of my reinforcement learning policy. However, I'm using a gradient clipping by norm with 0.2 as the clipping factor.
I've check both my inputs and my loss and none of them are NAN. Only my gradients face an issue.
All of the gradients without exception becomes Nan in only 1 step and I don't understand how it is possible since I'm clipping it. Shouldn't tensorflow transform the nan gradients into a clipped vector ?
Here is the input data when the nan gradients appear :
INPUT : [0.1, 0.0035909, 0.06, 0.00128137, 0.6, 0.71428571, 0.81645947, 0.46802986, 0.04861736, 0.01430704, 0.08, 0.08966659, 0.02, 0.]
Here are the 10 previous loss value (last value being the one when the gradients become NaN)
[-0.0015171316, -0.0015835371, 0.0002261286, 0.0003917102, -0.0024305983, -0.0054471847, 0.00082066684, 0.0038477872, 0.012144111]
Here is the network I'm using, hiddens_dims
is a list containing the number of nodes of the consecutive Dense layers (I'm dynamically making those layers) :
class NeuralNet(tf.keras.Model):
def __init__(self, hiddens_dim = [4,4] ):
self.hidden_layers = [tf.keras.layers.Dense(hidden_dim,
activation= 'elu',
kernel_initializer= tf.keras.initializers.VarianceScaling(),
kernel_regularizer= tf.keras.regularizers.L1(l1= 0.001),
name= f'hidden_{i}')
for i,hidden_dim in enumerate(hiddens_dim)
]
# Output layers
self.output_layer = tf.keras.layers.Dense(self.out_dim,
activation= 'softmax',
kernel_initializer= tf.keras.initializers.GlorotNormal(),
name= 'output')
def call(self, input):
x = input
for layer in self.hidden_layers :
x = layer(x)
output = self.output_layer(x)
return output
Now here is the part where I update the gradient manually :
model = NeuralNet([4,4])
optim = tf.keras.optimizers.Adam(learning_rate= 0.01)
...
with tf.GradientTape() as tape :
loss = compute_loss(rewards, log_probs)
grads = tape.gradient(loss, self.model.trainable_variables)
grads = [(tf.clip_by_norm(grad, clip_norm=self.clip)) for grad in grads]
optim.apply_gradients( zip(grads, self.model.trainable_variables) )
And Finally, here are the gradients in the previous iteration, right before the catastrophe :
Gradient Hidden Layer 1 : [
[-0.00839788, 0.00738428, 0.0006091 , 0.00240378],
[-0.00171666, 0.00157034, 0.00012367, 0.00051114],
[-0.0069742 , 0.00618575, 0.00050313, 0.00201353],
[-0.00263796, 0.00235524, 0.00018991, 0.00076653],
[-0.01119559, 0.01178695, 0.0007518 , 0.00383774],
[-0.08692611, 0.07620181, 0.00630627, 0.02480747],
[-0.10398869, 0.09012008, 0.00754619, 0.02933704],
[-0.04725896, 0.04004722, 0.00343443, 0.01303552],
[-0.00493888, 0.0043246 , 0.00035772, 0.00140733],
[-0.00559061, 0.00484629, 0.00040546, 0.00157689],
[-0.00595227, 0.00524359, 0.00042967, 0.00170693],
[-0.02488269, 0.02446024, 0.00177054, 0.00796351],
[-0.00850916, 0.00703857, 0.00062265, 0.00229139],
[-0.00220688, 0.00196331, 0.0001586 , 0.0006386 ]]
Gradient Hidden Layer 2 : [
[-2.6317715e-04, -2.1482834e-04, 3.0761934e-04, 3.1322116e-04],
[ 8.4564053e-03, 6.7548533e-03, -9.8721031e-03, -1.0047102e-02],
[-3.8322039e-05, -3.1298561e-05, 4.3669730e-05, 4.4472294e-05],
[ 3.6933038e-03, 2.9515910e-03, -4.3102605e-03, -4.3875999e-03]]
Gradient Output Layer :
[-0.0011955 , 0.0011955 ],
[-0.00074397, 0.00074397],
[-0.0001833 , 0.0001833 ],
[-0.00018749, 0.00018749]]
I'm not very familiar with tensorflow so maybe I'm not training the model correctly ? However, the model seemed to train correctly before the gradients become crazy.
I know I can use many other methods to counter exploding gradient (batch norm, dropout, decrease the learning rate etc) but I want to understand why gradient clipping is not working here ? I thought that gradient can't explode when we clip it by definition
Thank you