1

I have around 100k "data batches" of sequential data which I am running a fairly complex recurrent model on (120k params). After some point (which seems rather random), the loss turns to nan. I tried the following

  1. checked data for non numerics which turned out to be fine
  2. Gradient clipped it to norm 1
  3. Constrained every layer's parameters,
  4. Lowered the learning rate and added to the epsilon in RMSProp, however I am still getting NaN after a certain point.

Anything else I can attempt to try to debug?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
GTOgod
  • 93
  • 7

2 Answers2

1

Without code, I can only give a very general answer:

NaN can occur when you:

  • Divide by 0
  • Logarithm of too small numbers
  • sqrt of something negative

Look at the optimization metric to see what might happen in your case. Look for points where (absolute) numbers can get very large or very small. Often, adding a small constant solves the problem.

There are many other cases, which are likely not relevant to you:

  • arcsin outside of [-1, 1]
  • float('inf') / float('inf')
  • 0 * float('inf')

See also: My guide for debugging neural networks

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
-2

I don't use recurrent networks, however I've encountered the sporadic NaN problem in my own work with CNN's when training batch sizes are small. Try enlarging your batch size.

John Ladasky
  • 1,016
  • 8
  • 17
  • 1
    can you explain how the NaN occured with small batch sizes and how large batch sizes prevents that? – Zaccharie Ramzi May 19 '19 at 09:35
  • I suspect that exploding gradients are to blame, and even though I don't use RNN's I know they are vulnerable to oscillations. In my case, I was working with an unusual activation function, and I'm not sure that there was an upper bound on the slope. If an unlucky batch was generated, I believe that it was possible for the gradient descent algorithm to output a step so large that the error function would overflow. My models would train dependably, provided that batch sizes were at least 16. I got the occasional NaN with a batch size of 8, and quite a few with batches of 4. – John Ladasky May 22 '19 at 06:10
  • But have you checked all those things? What was your architecture (loss function included)? I am very surprised that batch size would something to do with or be a solution to exploding gradients. Usually you would use batch normalization or gradient clipping to avoid that problem. – Zaccharie Ramzi May 22 '19 at 07:42