How to deal with NaN and 0 for Loss and Validation during Training

Question

I am using SSD512(imagenet pre-trained model) and Faster_R-CNN(pre-trained) while training, the loss and confidence displays nan and validation as 0.

[Basketball-ChainerCV] (https://github.com/atom2k17/Basketball-ChainerCV/blob/master/basketballproject.py).

This is the image for SSD300 training below:

When training Faster R-CNN before training starts the following is displayed before the result of the first set of epochs:

/usr/local/lib/python3.6/dist- 
packages/chainercv/links/model/faster_rcnn/utils/loc2bbox.py:65: 
RuntimeWarning: overflow encountered in exp
  h = xp.exp(dh) * src_height[:, xp.newaxis]
/usr/local/lib/python3.6/dist- 
packages/chainercv/links/model/faster_rcnn/utils/loc2bbox.py:65: 
RuntimeWarning: overflow encountered in multiply
  h = xp.exp(dh) * src_height[:, xp.newaxis]
/usr/local/lib/python3.6/dist- 
packages/chainercv/links/model/faster_rcnn/utils/loc2bbox.py:66: 
RuntimeWarning: overflow encountered in exp
  w = xp.exp(dw) * src_width[:, xp.newaxis]
/usr/local/lib/python3.6/dist- 
packages/chainercv/links/model/faster_rcnn/utils/loc2bbox.py:66: 
RuntimeWarning: overflow encountered in multiply
  w = xp.exp(dw) * src_width[:, xp.newaxis]
/usr/local/lib/python3.6/dist- 
packages/chainercv/links/model/faster_rcnn/utils/proposal_creator.py:126: 
RuntimeWarning: invalid value encountered in greater_equal

Things I have tried:

Increasing learning rate
Decreasing batch_size
Removed images, annotations and contents in text files that have images where the bounding box is less than 1% of the total image size

Note: Everything works perfectly fine with SSD300, the issues are with SSD512 and Faster RCNN models. All the models are pre-trained on ImageNet dataset.

What are the issue/issues behind the problem? Can anyone give pointers on how to deal with such issues?

Can you try *decreasing* learning rate. increasing learning rate makes training unstable. I think you can even try setting learning rate to 0. And if nan is detected even in such case, this may be because input is not correctly preprocessed and its scale is wrong from expected. — corochann, Mar 22 '19 at 04:18
Setting learning rate to 0 solves the nan issue but even after 60 epochs the main loss, loc_loss, conf_loss and validation displays the same value as in first epoch...i.e little change in value. Maybe the model is not learning?...what can be done to overcome this obstacle? — TulakHord, Mar 22 '19 at 05:55
@corochann 2 points I wanted to mention are: that I used same data with ssd300 model which worked correctly. So I think data is pre-processed correctly. When it comes to scale it follows the ChainerCV convention (y_min, x_min, y_max, x_max). — TulakHord, Mar 22 '19 at 07:31
what is your current batch_size & official example's batch_size? If `BatchNormalization` is inside the network, too few batch_size makes learning unstable. — corochann, Mar 24 '19 at 00:45
Setting learning rate to 0 is just for debugging (to see if input contains nan or not), the model does not learn anything when learning rate is 0. You can set small value (0.01 or 0.001 or 0.0001) for model to learn. — corochann, Mar 24 '19 at 00:49

How to deal with NaN and 0 for Loss and Validation during Training

0 Answers0