While the model is training, it appears as loss = nan after a certain step

Question

I'm working on TensorFlow object detection. I'm using Google Colab. While the model is training, it appears as loss = nan after a certain step. How can I fix this?

Model= ssd_efficientdet_d2

Output=

I1125 09:30:20.814607 139701278168960 model_lib_v2.py:652] Step 1400 per-step time 0.418s loss=1.650 INFO:tensorflow:Step 1500 per-step time 0.601s loss=1.285

I1125 09:31:09.918310 139701278168960 model_lib_v2.py:652] Step 1500 per-step time 0.601s loss=1.285
INFO:tensorflow:Step 1500 per-step time 0.601s loss=1.285

I1125 09:31:09.918310 139701278168960 model_lib_v2.py:652] Step 1500 per-step time 0.601s loss=1.285
INFO:tensorflow:Step 1600 per-step time 0.444s loss=1.344

I1125 09:31:59.594189 139701278168960 model_lib_v2.py:652] Step 1600 per-step time 0.444s loss=1.344
INFO:tensorflow:Step 1700 per-step time 0.511s loss=nan

I1125 09:32:49.015780 139701278168960 model_lib_v2.py:652] Step 1700 per-step time 0.511s loss=nan
INFO:tensorflow:Step 1800 per-step time 0.576s loss=nan

I1125 09:33:39.257319 139701278168960 model_lib_v2.py:652] Step 1800 per-step time 0.576s loss=nan
INFO:tensorflow:Step 1900 per-step time 0.439s loss=nan

I1125 09:34:27.547188 139701278168960 model_lib_v2.py:652] Step 1900 per-step time 0.439s loss=nan
INFO:tensorflow:Step 2000 per-step time 0.445s loss=nan

I1125 09:35:17.008013 139701278168960 model_lib_v2.py:652] Step 2000 per-step time 0.445s loss=nan
INFO:tensorflow:Step 2100 per-step time 0.490s loss=nan

I1125 09:36:08.541600 139701278168960 model_lib_v2.py:652] Step 2100 per-step time 0.490s loss=nan
INFO:tensorflow:Step 2200 per-step time 0.697s loss=nan

Your gradients are probably exploding (your loss is going up). Reducing your learning rate could be a good first thing to try, — Lescurel, Nov 25 '20 at 10:02
@cbk I am experiencing the same problem. How have you solved it? — Matthias, Mar 02 '21 at 09:50

score 2 · Answer 1 · answered Nov 25 '20 at 13:07

There are lots of things I have seen make a model diverge and which may lead to the increase in loss or decrease of accuracy.

Could be due to high of a learning rate, so first and foremost decrease the learning rate.
Check the classifier DNNClassifier if you are using the correct one.
Check if the labels are the correct ones, and are in the domain of loss function.
Check the loss function as well. Sometimes, it is the reason, the input data is not according the loss function.
Make sure the data is properly normalized. You probably want to have the pixels in the range [-1, 1] and not [0, 255].

While the model is training, it appears as loss = nan after a certain step

1 Answers1