0

I am using SSD512(imagenet pre-trained model) and Faster_R-CNN(pre-trained) while training, the loss and confidence displays nan and validation as 0.

[Basketball-ChainerCV] (https://github.com/atom2k17/Basketball-ChainerCV/blob/master/basketballproject.py).

This is the image for SSD300 training below: SSD512 training image link

When training Faster R-CNN before training starts the following is displayed before the result of the first set of epochs:

/usr/local/lib/python3.6/dist- 
packages/chainercv/links/model/faster_rcnn/utils/loc2bbox.py:65: 
RuntimeWarning: overflow encountered in exp
  h = xp.exp(dh) * src_height[:, xp.newaxis]
/usr/local/lib/python3.6/dist- 
packages/chainercv/links/model/faster_rcnn/utils/loc2bbox.py:65: 
RuntimeWarning: overflow encountered in multiply
  h = xp.exp(dh) * src_height[:, xp.newaxis]
/usr/local/lib/python3.6/dist- 
packages/chainercv/links/model/faster_rcnn/utils/loc2bbox.py:66: 
RuntimeWarning: overflow encountered in exp
  w = xp.exp(dw) * src_width[:, xp.newaxis]
/usr/local/lib/python3.6/dist- 
packages/chainercv/links/model/faster_rcnn/utils/loc2bbox.py:66: 
RuntimeWarning: overflow encountered in multiply
  w = xp.exp(dw) * src_width[:, xp.newaxis]
/usr/local/lib/python3.6/dist- 
packages/chainercv/links/model/faster_rcnn/utils/proposal_creator.py:126: 
RuntimeWarning: invalid value encountered in greater_equal 

Faster_R-CNN training image link

Things I have tried:

  1. Increasing learning rate
  2. Decreasing batch_size
  3. Removed images, annotations and contents in text files that have images where the bounding box is less than 1% of the total image size

Note: Everything works perfectly fine with SSD300, the issues are with SSD512 and Faster RCNN models. All the models are pre-trained on ImageNet dataset.

What are the issue/issues behind the problem? Can anyone give pointers on how to deal with such issues?

TulakHord
  • 422
  • 7
  • 15
  • 1
    Can you try *decreasing* learning rate. increasing learning rate makes training unstable. I think you can even try setting learning rate to 0. And if nan is detected even in such case, this may be because input is not correctly preprocessed and its scale is wrong from expected. – corochann Mar 22 '19 at 04:18
  • Setting learning rate to 0 solves the nan issue but even after 60 epochs the main loss, loc_loss, conf_loss and validation displays the same value as in first epoch...i.e little change in value. Maybe the model is not learning?...what can be done to overcome this obstacle? – TulakHord Mar 22 '19 at 05:55
  • @corochann 2 points I wanted to mention are: that I used same data with ssd300 model which worked correctly. So I think data is pre-processed correctly. When it comes to scale it follows the ChainerCV convention (y_min, x_min, y_max, x_max). – TulakHord Mar 22 '19 at 07:31
  • what is your current batch_size & official example's batch_size? If `BatchNormalization` is inside the network, too few batch_size makes learning unstable. – corochann Mar 24 '19 at 00:45
  • 1
    Setting learning rate to 0 is just for debugging (to see if input contains nan or not), the model does not learn anything when learning rate is 0. You can set small value (0.01 or 0.001 or 0.0001) for model to learn. – corochann Mar 24 '19 at 00:49

0 Answers0