tensorflow gradient - getting all nan values

Question

I am using python 3 with anaconda, and tensorflow 1.12 with eager eval.

I am using it to create a triplet loss function for a siamese network, and need to calculate distance between different data samples.

I created a function in order to create the distance calculation, but no matter what I do, when I try to calculate it's gradient with respect to the networks output, It keeps giving me all nan gradient.

This is the code:

def matrix_row_wise_norm(matrix):
    import tensorflow as tf
    tensor = tf.expand_dims(matrix, -1)

    tensor = tf.transpose(tensor, [0, 2, 1]) - tf.transpose(tensor, [2, 0, 1])
    norm = tf.norm(tensor, axis=2)
    return norm

In the loss function I am using

def loss(y_true, p_pred):
    with tf.GradientTape() as t:
    t.watch(y_pred)
        distance_matrix = matrix_row_wise_norm(y_pred)
        grad = t.gradient(distance_matrix, y_pred)

And the grad is all nans. I checked that y_pred is made of legit values - and it does. I tried to create a gradient of y_pred * 2 with respect to itself and got legitimate gradient values.

What am I missing here? Is the indexing in the creation of the distance matrix problematic?

edit:

the dtype of both y_pred and loss is tf.float32

edit: found an open bug report in tf - could this be the issue?

edit:

When I change the norm axis to 0 or 1, I am getting legitimate values and nothing goes to nan. The operation I am getting using norm with axis=2 is the pairwise distance between the pairs of rows in the matrix, I suspected this might have something to do with 0 distance between a row to itself, so I clipped the values with min value of 1e-7 without any luck.

Thanks

I had same problem, please check `dtype` of `y_pred` and `loss`. — Ankish Bansal, Jan 24 '19 at 12:16
What are each of the axes of your matrix? My only guess is that `norm(tensor, axis=2)` or the transpose and subtract operation above it does not have a gradient. I've run into that issue before with custom loss functions and, I think, reshaping? Non-differentiable operations seem to kill the gradient computation. — Engineero, Jan 24 '19 at 14:52
@Engineero - what i do here, is to take a matrix, each row is a vector, I am trying to create pairwise distance between all the vectors, and getting this by duplicating the vectors, transposing, subtracting and using norm, How could this not have a gradient? — thebeancounter, Jan 24 '19 at 14:54

score 4 · Accepted Answer · edited Oct 15 '19 at 23:46

Seems that tf.norm suffers from numeric instability as explained here

They also suggest using l2 norm that is more numeric stable, So I tried that, also getting nan values, thanks to 0 gradients. So I used those together with gradient clipping, so far so good, the loss function is working and manages to converge.

def last_attempt(y_true, y_pred):
    import tensorflow as tf
    import numpy as np

    loss = tf.zeros(1)

    for i in range(y_pred.shape[0]):
        dist = tf.gather(y_pred, [i], axis=0)
        y = y_true.numpy().squeeze()
        norm = tf.map_fn(tf.nn.l2_loss, dist-y_pred)

        d = norm.numpy()
        d[np.where(y != y[i])] = 0.0
        max_pos = tf.gather(norm, np.argmax(d))

        d = norm.numpy()
        d[np.where(y == y[i])] = np.inf
        min_neg = tf.gather(norm, np.argmin(d))

        loss += tf.clip_by_value(max_pos - min_neg + tf.constant(1, dtype=tf.float32),
                                 1e-8, 1e1)

    return loss

There is much room for optimizing that function, here is a reference to my other SO question - working on that.

Can we just add an epsilon inside the norm to make a safe norm like that in [this answer](https://stackoverflow.com/a/44960540/3552975)? — Lerner Zhang, Aug 22 '20 at 15:19

tensorflow gradient - getting all nan values

1 Answers1

Linked