Tensorflow, gradients become NAN even when I clip it

Question

It seems like I have an exploding gradient issue during the training of my reinforcement learning policy. However, I'm using a gradient clipping by norm with 0.2 as the clipping factor.

I've check both my inputs and my loss and none of them are NAN. Only my gradients face an issue.

All of the gradients without exception becomes Nan in only 1 step and I don't understand how it is possible since I'm clipping it. Shouldn't tensorflow transform the nan gradients into a clipped vector ?

Here is the input data when the nan gradients appear :

INPUT : [0.1, 0.0035909, 0.06, 0.00128137, 0.6, 0.71428571, 0.81645947, 0.46802986, 0.04861736, 0.01430704, 0.08, 0.08966659, 0.02, 0.]

Here are the 10 previous loss value (last value being the one when the gradients become NaN)

[-0.0015171316, -0.0015835371, 0.0002261286, 0.0003917102, -0.0024305983, -0.0054471847, 0.00082066684, 0.0038477872, 0.012144111]

Here is the network I'm using, hiddens_dims is a list containing the number of nodes of the consecutive Dense layers (I'm dynamically making those layers) :


class NeuralNet(tf.keras.Model):

    def __init__(self, hiddens_dim = [4,4] ):

        self.hidden_layers = [tf.keras.layers.Dense(hidden_dim, 
                                                    activation= 'elu', 
                                                    kernel_initializer= tf.keras.initializers.VarianceScaling(),
                                                    kernel_regularizer= tf.keras.regularizers.L1(l1= 0.001),
                                                    name= f'hidden_{i}') 
                                                    for i,hidden_dim in enumerate(hiddens_dim)
                             ]

        # Output layers
        self.output_layer = tf.keras.layers.Dense(self.out_dim, 
                                                    activation= 'softmax', 
                                                    kernel_initializer= tf.keras.initializers.GlorotNormal(),
                                                    name= 'output')


    def call(self, input):
        x = input
        for layer in self.hidden_layers :
            x = layer(x)
        output = self.output_layer(x)

        return output

Now here is the part where I update the gradient manually :

                model = NeuralNet([4,4])
                optim = tf.keras.optimizers.Adam(learning_rate= 0.01)
                
                ...

                with tf.GradientTape() as tape :
                    loss = compute_loss(rewards, log_probs)
                grads = tape.gradient(loss, self.model.trainable_variables)
                grads = [(tf.clip_by_norm(grad, clip_norm=self.clip)) for grad in grads]
                optim.apply_gradients( zip(grads, self.model.trainable_variables) )

And Finally, here are the gradients in the previous iteration, right before the catastrophe :

Gradient Hidden Layer 1 : [
       [-0.00839788,  0.00738428,  0.0006091 ,  0.00240378],
       [-0.00171666,  0.00157034,  0.00012367,  0.00051114],
       [-0.0069742 ,  0.00618575,  0.00050313,  0.00201353],
       [-0.00263796,  0.00235524,  0.00018991,  0.00076653],
       [-0.01119559,  0.01178695,  0.0007518 ,  0.00383774],
       [-0.08692611,  0.07620181,  0.00630627,  0.02480747],
       [-0.10398869,  0.09012008,  0.00754619,  0.02933704],
       [-0.04725896,  0.04004722,  0.00343443,  0.01303552],
       [-0.00493888,  0.0043246 ,  0.00035772,  0.00140733],
       [-0.00559061,  0.00484629,  0.00040546,  0.00157689],
       [-0.00595227,  0.00524359,  0.00042967,  0.00170693],
       [-0.02488269,  0.02446024,  0.00177054,  0.00796351],
       [-0.00850916,  0.00703857,  0.00062265,  0.00229139],
       [-0.00220688,  0.00196331,  0.0001586 ,  0.0006386 ]]

Gradient Hidden Layer 2 : [
       [-2.6317715e-04, -2.1482834e-04,  3.0761934e-04,  3.1322116e-04],
       [ 8.4564053e-03,  6.7548533e-03, -9.8721031e-03, -1.0047102e-02],
       [-3.8322039e-05, -3.1298561e-05,  4.3669730e-05,  4.4472294e-05],
       [ 3.6933038e-03,  2.9515910e-03, -4.3102605e-03, -4.3875999e-03]]


Gradient Output Layer : 
       [-0.0011955 ,  0.0011955 ],
       [-0.00074397,  0.00074397],
       [-0.0001833 ,  0.0001833 ],
       [-0.00018749,  0.00018749]]

I'm not very familiar with tensorflow so maybe I'm not training the model correctly ? However, the model seemed to train correctly before the gradients become crazy.

I know I can use many other methods to counter exploding gradient (batch norm, dropout, decrease the learning rate etc) but I want to understand why gradient clipping is not working here ? I thought that gradient can't explode when we clip it by definition

Thank you

Tensorflow, gradients become NAN even when I clip it

0 Answers0