4

I have a gradient exploding problem which I couldn't solve after trying for several days. I implemented a custom message passing graph neural network in TensorFlow which is used to predict a continuous value from graph data. Each graph is associated with one target value. Each node of a graph is represented by a node attribute vector, and the edges between nodes are represented by an edge attribute vector.

Within a message passing layer, node attributes are updated in a certain way (e.g., by aggregating other node/edge attributes), and these updated node attributes are returned.

Now, I managed to figure out where the gradient problem occurs in my code. I have the below snippet.

to_concat = [neighbors_mean, e]
z = K.concatenate(to_concat, axis=-1)
output = self.Net(z)

Here, neighbors_mean is the element-wise mean between two node attributes vi, vj that form the edge having an edge attribute e. Net is a single layer feed-forward network. With this, the training loss suddenly jumps to NaN after about 30 epochs with a batch size of 32. If the batch size is 128, still the gradients explode after about 200 epochs.

I found that, in this case, the gradients explode because of the edge attribute e. If I didn't concatenate neighbors_mean with e and just used the below code, there would be no gradient explosion.

output = self.Net(neighbors_mean)

Also I can avoid gradient explosion by sending e through a sigmoid function as follows. But this degrades the performance (final MAE), because the values in e are mapped to 0-1 range non-linearly. Note that Rectified Linear Unit (ReLU) instead of sigmoid didn't work.

to_concat = [neighbors_mean, tf.math.sigmoid(e)]
z = K.concatenate(to_concat, axis=-1)
output = self.Net(z)

Just to mention that e carries a single value relating to the distance between the two corresponding nodes and this distance is always in the range 0.5-4. There are no large values or NaNs in e.

I have a custom loss function to train this model, but I found that this is not a problem with loss (other losses also led to the same problem). Below is my custom loss function. Note that although this is a single output regression network, the final layer of my NN has two neurons, relating to the mean and log(sigma) of the prediction.

def robust_loss(y_true, y_pred):
  """
  Computes the robust loss between labels and predictions.
  """
  mean, sigma = tf.split(y_pred, 2, axis=-1)
  # tried limiting 'sigma' with  sigma = tf.clip_by_value(sigma,-4,1.0) but the gradients still explode
  loss =  np.sqrt(2.0) * K.abs(mean - y_true) * K.exp(-sigma)  + sigma
  return K.mean(loss)

I basically tried everything suggested online to avoid gradient explosion.

  1. Applied gradient clipping - with Adam(lr, clipnorm=1, clipvalue=5) and also with tf.clip_by_global_norm(gradients, 1.0)
  2. My target variables are always scaled
  3. Weights are initialized with glorot_uniform distribution
  4. Applied regularisation to weights
  5. Tried larger batch sizes (till 256, although delayed gradient explosion happens at some point)
  6. Tried with reduced learning rate

What am I missing here? I definitely know it has something to do with concatenating e. But given that 0.5<e<4, why do the gradients explode in this case? This feature e is important to me. What else can I do to avoid numerical overflow in my model?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Achintha Ihalage
  • 2,310
  • 4
  • 20
  • 33
  • Is the graph you mentioned a Directed Acyclic Graph? – Bob Oct 07 '21 at 07:49
  • It is undirected, and there could be cycles in the graph. – Achintha Ihalage Oct 07 '21 at 09:50
  • Maybe you should introduce some regularization to the node values? It seems what you are trying to solve is analog to a factor graph, the message passing algorithm may have difficulty to converge when the factor graph has cycles. I am not sure if this is related to your problem. Could you please write a more detailed description so that we can analyze, please? – Bob Oct 07 '21 at 09:56
  • I think my graphs are not necessarily factor graphs. All my graphs represent crystal structures (so called [crystal graphs](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.120.145301#fulltext)). If the distance between 2 atoms is less than a threshold, I consider them "bonded". With this definition I may have cyclic graphs. Node (atom) attributes are properly normalised. But I believe the problem is in this edge attribute `e`. Because the gradients start to explode only when I concatenate `e`. Please let me know what details you need more. I can update the question. – Achintha Ihalage Oct 07 '21 at 10:14

2 Answers2

1

Looks great, as you have already followed most of the solutions to resolve gradient exploding problem. Below is the list of all solutions you can try

Solutions to avoid Gradient Exploding problem

  1. Appropriate Weight initialization: utilise appropriate weight Initialization based on the activation function used.

    Initialization Activation Function
    He ReLU & variants
    LeCun SELU
    Glorot Softmax, Logistic, None, Tanh
  2. Redesigning your Neural network: use fewer layers in neural network and/or use smaller batch size

  3. Choosing Non Saturation activation function: choose the right activation function with reduced learning rates

    • ReLU
    • Leaky ReLU
    • randomized leaky ReLU (RReLU)
    • parametric leaky ReLU (PReLU)
    • exponential linear unit (ELU)
  4. Batch Normalisation: Ideally using batch normalisation before/after each layer, based on what works best for your dataset.

    • after each layer Paper reference

      model = keras.models.Sequential([
                           keras.layers.Flatten(input_shape=[28, 28]),
                           keras.layers.BatchNormalization(),
                           keras.layers.Dense(300, activation="elu", 
                           kernel_initializer="he_normal"),
                           keras.layers.BatchNormalization(),
                           keras.layers.Dense(100, activation="elu", 
                           kernel_initializer="he_normal"),
                           keras.layers.BatchNormalization(),
                           keras.layers.Dense(10, activation="softmax")
               ])
      
    • before each layer

       model = keras.models.Sequential([
                           keras.layers.Flatten(input_shape=[28, 28]),
                           keras.layers.BatchNormalization(),
                           keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
                           keras.layers.BatchNormalization(),
                           keras.layers.Activation("elu"),
                           keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
                           keras.layers.Activation("elu"),
                           keras.layers.BatchNormalization(),
                           keras.layers.Dense(10, activation="softmax")
               ])
      
  5. Gradient Clipping : Good default values are clipnorm=1.0 and clipvalue=0.5

  6. Ensure right optimizer is utilised: Since you have utilised Adam optimizer, check if other optimizer works best for your case. Refer this documentation for info on the available optimizers [SGD, RMSprop, Adam, Adadelta, Adagrad, Adamax, Nadam, Ftrl]

  7. Truncated Backpropagation through time: often works for RNNS refer this documentation

  8. Use LSTM(solution for RNN)

  9. Use weight regularizers on layers: set kernel_regularizer to L1 or L2. Weight regularizer document reference

For more information refer to chapter 11 in Hands on Machine learning with scikit-learn, keras and tensorflow book by Aurélien

Archana David
  • 1,354
  • 2
  • 15
  • 26
  • Thanks for taking time to answer. I had tried almost all these things for several days but none of them worked. I suspect the problem might be because of the different scales of two tensors that I am concatenating `neighbors_mean` and `e`. What I don't understand is why the model trains as usual for several epochs and then suddenly the loss jumps to NaN. – Achintha Ihalage Oct 09 '21 at 22:16
  • 1
    You seem to be facing issue only utilising ReLU and performs well in sigmoid. Do try my suggestion provided in 3 choosing the right non saturation activation function,try different variant of ReLU like Leaky ReLU, randomized leaky ReLU (RReLU), parametric leaky ReLU (PReLU) or exponential linear unit (ELU) – Archana David Oct 11 '21 at 01:25
0

I solved the problem thanks to this cool debugging tool tf.debugging.check_numerics.

I initially identified concatenating e was the problem, and then realised the values that get passed onto e are considerably larger than the values in neighbors_mean which is concatenated with e. Once they are concatenated and sent through a neural network (Net() in my code), I observed some outputs in order of hundreds and slowly reaching thousands as the training progresses.

This is problematic as I have a softmax operation within the message passing layer. Note that softmax calculates an exponential (exi/Σexj). Anything above e709 results in a numerical overflow in Python. This was producing inf values and eventually everything becoming nan was the problem in my code. So, this is technically not a gradient exploding problem which is why it couldn't be solved with gradient clipping.

How did I track the issue?

I put tf.debugging.check_numerics() snippets under several layers/tensors I thought were producing nan values. Something like this:

tf.debugging.check_numerics(layerN, "LayerN is producing nans!")

This produces an InvalidArgumentError as soon as the layer outputs become inf or nan during training.

Traceback (most recent call last):
  File "trainer.py", line 506, in <module>
    worker.train_model()
  File "trainer.py", line 211, in train_model
    l, tmae = train_step(*batch)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 855, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  LayerN is producing nans! : Tensor had NaN values

Now we know where the problem is.

How to solve the issue

I applied kernel constraints to the neural network weights whose output gets passed onto the softmax function.

layers.Dense(x, name="layer1", kernel_regularizer=regularizers.l2(1e-6), kernel_constraint=min_max_norm(min_value=1e-30, max_value=1.0))

This should make sure that all weights are less than 1 and the layer does not produce large outputs. This resolved the problem without degrading the performance.

Alternatively, one could use the numerically stable implementation of the softmax function.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Achintha Ihalage
  • 2,310
  • 4
  • 20
  • 33