Deep Learning caffe - classification of data leads to NaN

Question

I have a trained caffe net for a 2class problem and want to check for the net output for one single data. So I run the classification like this:

proto = 'deploy.prototxt'
model = 'snapshot_iter_4000.caffemodel'
net = caffe.Net(proto, model, caffe.TEST)

# get image from database to variable "image"
out = net.forward_all(data=image)
print out
>> {'prob': array([[ nan,  nan],
    [ nan,  nan]], dtype=float32)}

I looked at the training output; I saw that the accuracy never gets better (it's always around 0.48). I have checked all the input lmdb's, there are no data containing NaN's in it. Moreover, I always train several classifiers with the same dataset, and they work as expected.

Has anyone encountered this problem? Are there some numerical instabilities known for caffe?

Would be glad if someone can help me out! Thanks =)

This is the solver.prototxt I used for all nets:

test_iter:100
test_interval:100
base_lr: 0.03 
display:50
max_iter: 6000 
lr_policy: "step" 
gamma: 0.1 
momentum:0.9
weight_decay:0.0005
stepsize: 2000 
snapshot:2000
snapshot_prefix:"snapshot"
solver_mode:GPU
net:"train_val.prototxt"
solver_type:SGD

and the net architecture (which is the AlexNet):

layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 70
  }
  data_param {
    source: "./dataset/train_db"
    batch_size: 300
    backend: LMDB
  }
}
layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  transform_param {
    crop_size: 70
  }
  data_param {
    source: "./dataset/val_db"
    batch_size: 300
    backend: LMDB
  }
}

layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  convolution_param {
    num_output: 96
    kernel_size: 11
    stride: 4
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0.0
    }
  }
}
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "conv1"
  top: "conv1"
}
layer {
  name: "norm1"
  type: "LRN"
  bottom: "conv1"
  top: "norm1"
  lrn_param {
    local_size: 5
    alpha: 0.0001
    beta: 0.75
  }
}
layer {
  name: "pool1"
  type: "Pooling"
  bottom: "norm1"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 3
    stride: 1
  }
}
layer {
  name: "conv2"
  type: "Convolution"
  bottom: "pool1"
  top: "conv2"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  convolution_param {
    num_output: 256
    pad: 2
    kernel_size: 5
    group: 2
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layer {
  name: "relu2"
  type: "ReLU"
  bottom: "conv2"
  top: "conv2"
}
layer {
  name: "norm2"
  type: "LRN"
  bottom: "conv2"
  top: "norm2"
  lrn_param {
    local_size: 5
    alpha: 0.0001
    beta: 0.75
  }
}
layer {
  name: "pool2"
  type: "Pooling"
  bottom: "norm2"
  top: "pool2"
  pooling_param {
    pool: MAX
    kernel_size: 3
    stride: 2
  }
}
layer {
  name: "conv3"
  type: "Convolution"
  bottom: "pool2"
  top: "conv3"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  convolution_param {
    num_output: 384
    pad: 1
    kernel_size: 3
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0.0
    }
  }
}
layer {
  name: "relu3"
  type: "ReLU"
  bottom: "conv3"
  top: "conv3"
}
layer {
  name: "conv4"
  type: "Convolution"
  bottom: "conv3"
  top: "conv4"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  convolution_param {
    num_output: 384
    pad: 1
    kernel_size: 3
    group: 2
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layer {
  name: "relu4"
  type: "ReLU"
  bottom: "conv4"
  top: "conv4"
}
layer {
  name: "conv5"
  type: "Convolution"
  bottom: "conv4"
  top: "conv5"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  convolution_param {
    num_output: 256
    pad: 1
    kernel_size: 3
    group: 2
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layer {
  name: "relu5"
  type: "ReLU"
  bottom: "conv5"
  top: "conv5"
}
layer {
  name: "pool5"
  type: "Pooling"
  bottom: "conv5"
  top: "pool5"
  pooling_param {
    pool: MAX
    kernel_size: 3
    stride: 2
  }
}
layer {
  name: "fc6"
  type: "InnerProduct"
  bottom: "pool5"
  top: "fc6"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  inner_product_param {
    num_output: 4096
    weight_filler {
      type: "gaussian"
      std: 0.005
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layer {
  name: "relu6"
  type: "ReLU"
  bottom: "fc6"
  top: "fc6"
}
layer {
  name: "drop6"
  type: "Dropout"
  bottom: "fc6"
  top: "fc6"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layer {
  name: "fc7"
  type: "InnerProduct"
  bottom: "fc6"
  top: "fc7"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  inner_product_param {
    num_output: 4096
    weight_filler {
      type: "gaussian"
      std: 0.005
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layer {
  name: "relu7"
  type: "ReLU"
  bottom: "fc7"
  top: "fc7"
}
layer {
  name: "drop7"
  type: "Dropout"
  bottom: "fc7"
  top: "fc7"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layer {
  name: "fc8"
  type: "InnerProduct"
  bottom: "fc7"
  top: "fc8"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  inner_product_param {
    num_output: 2
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0.0
    }
  }
}


layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "fc8"
  bottom: "label"
  top: "loss"
}
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "fc8"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
}

is it possible you have `nan`s in `'snapshot_iter_4000.caffemodel'`? — Shai, Jun 22 '16 at 15:23
Follow [this thread](http://stackoverflow.com/q/33962226/1714410). — Shai, Jun 22 '16 at 15:23

score 1 · Accepted Answer · edited May 23 '17 at 12:00

1

Update:

From the feedback comments under my answer, the reason that led to NaN in the question is that:

The scale of top: "data" in Data layer is [0, 255] while the initial learning rate is base_lr: 0.03 which is too large for that input data scale and thus led to divergence.

Normalizing top: "data" to [0, 1] in Data layer solved the problem:

transform_param {
    mirror: true
    scale: 0.00390625
    crop_size: 70
}

The NAN is more possibly indicating training divergence in your case, which means your training was not converging(this is indicated by your 0.48 training accuracy for 2 class classification). Since your input lmdb had worked before, the reason is more possibly that you used a too large learning rate, which would update the model parameters excessively during training and therefore resulted in numbers of NAN. So you may just try a smaller learning rate for example 10 times smaller till your training works.
Besides, the thread provided by @Shai in the comment above is also very good.

edited May 23 '17 at 12:00

Community

1
1

answered Jun 22 '16 at 15:31

Dale

1,608
1
9
26

Sorry, I was imprecise. I used the exact same training parameters (solver.prototxt) with the same dataset. So the learning rate (which was actually 0.03) should not be the problem. However, I do not use a custom Losslayer and I have checked the inputs for NaN's. What I discovered was that I get classifiers with an accuracy ~50% even without the NaN's for approx. 1 in 50 classifiers. So I thought, maybe - as i use Stochastic Gradient Descent - the random initialization was the problem- How do you think about this? – T_W Jun 28 '16 at 12:07
Is the net. prototxt also the same? – Dale Jun 28 '16 at 12:11
yes, I use the same caffe setup for all the classifiers – T_W Jun 28 '16 at 12:13
Can you upload your net. prototxt and solver. prototxt? – Dale Jun 28 '16 at 12:22
Do you mean that you had trained the exactly same network(solver/net.prototxt unchanged) several times successfully but only this time it failed? – Dale Jun 28 '16 at 12:33
@T_W As far as I know if you used ReLU as activation function, the Xavier random initialization method would be less possibly the problem(~50% without the NaN's) than Gaussian method. But in the case of sigmoid, both methods may result in your problem because it is easy to saturate even without NaN's observed. So a moderate smaller learning rate for example 0.001 is advised to try. – Dale Jun 28 '16 at 13:23
yes, I train exactly the same net several times. and yes, I use ReLus (actually I use the AlexNet structure), I have uploaded the solver/net prototxt to my original question – T_W Jun 29 '16 at 08:37
I notice that you didn't perform data transform such as substracting image mean or normalizing image data to [0, 1]. So the input data scale maybe the problem I guess. @T_W – Dale Jun 29 '16 at 10:16
No, so far I use data in the range from [0, 255]. I'll scale it and let you know if the problem still occurs. (this will probably take a while for me to figure that out as I have to train many nets) – T_W Jun 30 '16 at 08:32
Sorry, forgot to reply. Yes, the problem does not occur any more. Thanks! – T_W Sep 12 '16 at 09:32
Maybe you can update your question with the solution. : ) @T_W – Dale Sep 12 '16 at 12:34

Deep Learning caffe - classification of data leads to NaN

1 Answers1