Transformer Model Output Nan Values in Pytorch

Question

2021-03-09

I trained my transformer models in pytrorch. In the first few batches, the loss calculation and gradient updates were all performing well. However, the output of the model turned out to be nan values after several iterations. I am confident that there are no flawed data in the dataset. Besides, it's not a classification problem, the labels are float numbers.

2021-03-10

Follow-up: What an interesting story! When I ran this transformer model with a larger architecture (like 6 encoder layers, 8 heads, etc.). The NAN values disappeared. It seems that the gradient explosion only existed in tiny models.

Solutions: I searched the Pytorch forum and Stackoverflow and found out the accurate reason for this NAN instance. First, since the NAN loss didn't appear at the very beginning. We can conclude that the model might be well defined. The cause might be the data or the training process. I ran torch.autograd.set_detect_anomaly(True) as told in https://discuss.pytorch.org/t/gradient-value-is-nan/91663/2. It returned that the RuntimeError: Function ‘StdBackward1’ returned nan values in its 0th output.

According to the similar question in https://discuss.pytorch.org/t/gradient-of-standard-deviation-is-nan/14713, I double-checked the output in each layer inside the transformer. Strangely, after dozens of iterations, the positional embedding layer outputted a vector full of zeros. As a result, the LayerNorm that does the normalization job cannot backward the loss well, since it calculated the standard deviations and the standard deviation has no gradient at zero (or you can say it's infinite)! The possible solution is to add x.std(unbiased=False) if you are using pytorch.

This's my encounter with the NAN loss and mse. I hope my experience can give some insights to you when you meet this circumstance!

Relative Questions: Deep-Learning Nan loss reasons

It is possible that your gradient are exploding \ vanishing very quickly. Could you add a snippet of the loss and gradient values until they become NaN? — Shir, Mar 09 '21 at 07:05
you could use "torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)". if your parameters are not exploding quickly, which can happen for a faulty model, clipping your gradients could help your case. — Mehrdad, Dec 10 '21 at 13:02

score 1 · Answer 1 · answered Feb 07 '23 at 09:45

1

For what it's worth, I had this problem and it turned out that I had forgot to initialize an embedding vector, so it was just whatever torch.empty() happened to come upon (likely a lot of zeros.)

answered Feb 07 '23 at 09:45

lericson

1,318
1
12
22

Transformer Model Output Nan Values in Pytorch

1 Answers1