2021-03-09
I trained my transformer models in pytrorch. In the first few batches, the loss calculation and gradient updates were all performing well. However, the output of the model turned out to be nan values after several iterations. I am confident that there are no flawed data in the dataset. Besides, it's not a classification problem, the labels are float numbers.
2021-03-10
Follow-up: What an interesting story! When I ran this transformer model with a larger architecture (like 6 encoder layers, 8 heads, etc.). The NAN values disappeared. It seems that the gradient explosion only existed in tiny models.
Solutions:
I searched the Pytorch forum and Stackoverflow and found out the accurate reason for this NAN instance. First, since the NAN loss didn't appear at the very beginning. We can conclude that the model might be well defined. The cause might be the data or the training process. I ran torch.autograd.set_detect_anomaly(True)
as told in https://discuss.pytorch.org/t/gradient-value-is-nan/91663/2. It returned that the RuntimeError: Function ‘StdBackward1’ returned nan values in its 0th output
.
According to the similar question in https://discuss.pytorch.org/t/gradient-of-standard-deviation-is-nan/14713, I double-checked the output in each layer inside the transformer. Strangely, after dozens of iterations, the positional embedding layer outputted a vector full of zeros. As a result, the LayerNorm that does the normalization job cannot backward the loss well, since it calculated the standard deviations and the standard deviation has no gradient at zero (or you can say it's infinite)! The possible solution is to add x.std(unbiased=False)
if you are using pytorch.
This's my encounter with the NAN loss and mse. I hope my experience can give some insights to you when you meet this circumstance!
Relative Questions: Deep-Learning Nan loss reasons