Gradient vanishes when using batch normalization in caffe

Question

all

I run into problems when I use batch normalization in Caffe. Here is the code I used in train_val.prototxt.

layer {
      name: "conv1"
      type: "Convolution"
      bottom: "conv0"
      top: "conv1"
      param {
        lr_mult: 1
        decay_mult: 1
      }
      param {
        lr_mult: 0
        decay_mult: 0
      }
      convolution_param {
        num_output: 32
        pad: 1
        kernel_size: 3
        weight_filler {
          type: "gaussian"
          std: 0.0589
        }
        bias_filler {
          type: "constant"
          value: 0
        }
        engine: CUDNN
      }
    }
    layer {
      name: "bnorm1"
      type: "BatchNorm"
      bottom: "conv1"
      top: "conv1"
      batch_norm_param {
        use_global_stats: false
      }
    }
    layer {
      name: "scale1"
      type: "Scale"
      bottom: "conv1"
      top: "conv1"
      scale_param {
        bias_term: true
      }
    }
    layer {
      name: "relu1"
      type: "ReLU"
      bottom: "conv1"
      top: "conv1"
    }

layer {
  name: "conv16"
  type: "Convolution"
  bottom: "conv1"
  top: "conv16"
  param {
    lr_mult: 1
    decay_mult: 1
  }

However, the training is not converging. By removing BN layers (batchnorm + scale), the training can converge. So I started to compare the log files with or without the BN layers. Here are the log files with debug_info = true:

With BN:

I0804 10:22:42.074671  8318 net.cpp:638]     [Forward] Layer loadtestdata, top blob data data: 0.368457
I0804 10:22:42.074757  8318 net.cpp:638]     [Forward] Layer loadtestdata, top blob label data: 0.514496
I0804 10:22:42.076117  8318 net.cpp:638]     [Forward] Layer conv0, top blob conv0 data: 0.115678
I0804 10:22:42.076200  8318 net.cpp:650]     [Forward] Layer conv0, param blob 0 data: 0.0455077
I0804 10:22:42.076273  8318 net.cpp:650]     [Forward] Layer conv0, param blob 1 data: 0
I0804 10:22:42.076539  8318 net.cpp:638]     [Forward] Layer relu0, top blob conv0 data: 0.0446758
I0804 10:22:42.078435  8318 net.cpp:638]     [Forward] Layer conv1, top blob conv1 data: 0.0675479
I0804 10:22:42.078516  8318 net.cpp:650]     [Forward] Layer conv1, param blob 0 data: 0.0470226
I0804 10:22:42.078589  8318 net.cpp:650]     [Forward] Layer conv1, param blob 1 data: 0
I0804 10:22:42.079108  8318 net.cpp:638]     [Forward] Layer bnorm1, top blob conv1 data: 0
I0804 10:22:42.079197  8318 net.cpp:650]     [Forward] Layer bnorm1, param blob 0 data: 0
I0804 10:22:42.079270  8318 net.cpp:650]     [Forward] Layer bnorm1, param blob 1 data: 0
I0804 10:22:42.079350  8318 net.cpp:650]     [Forward] Layer bnorm1, param blob 2 data: 0
I0804 10:22:42.079421  8318 net.cpp:650]     [Forward] Layer bnorm1, param blob 3 data: 0
I0804 10:22:42.079505  8318 net.cpp:650]     [Forward] Layer bnorm1, param blob 4 data: 0
I0804 10:22:42.080267  8318 net.cpp:638]     [Forward] Layer scale1, top blob conv1 data: 0
I0804 10:22:42.080345  8318 net.cpp:650]     [Forward] Layer scale1, param blob 0 data: 1
I0804 10:22:42.080418  8318 net.cpp:650]     [Forward] Layer scale1, param blob 1 data: 0
I0804 10:22:42.080651  8318 net.cpp:638]     [Forward] Layer relu1, top blob conv1 data: 0
I0804 10:22:42.082074  8318 net.cpp:638]     [Forward] Layer conv16, top blob conv16 data: 0
I0804 10:22:42.082154  8318 net.cpp:650]     [Forward] Layer conv16, param blob 0 data: 0.0485365
I0804 10:22:42.082226  8318 net.cpp:650]     [Forward] Layer conv16, param blob 1 data: 0
I0804 10:22:42.082675  8318 net.cpp:638]     [Forward] Layer loss, top blob loss data: 42.0327

Without BN:

I0803 17:01:29.700850 30274 net.cpp:638]     [Forward] Layer loadtestdata, top blob data data: 0.320584
I0803 17:01:29.700920 30274 net.cpp:638]     [Forward] Layer loadtestdata, top blob label data: 0.236383
I0803 17:01:29.701556 30274 net.cpp:638]     [Forward] Layer conv0, top blob conv0 data: 0.106141
I0803 17:01:29.701633 30274 net.cpp:650]     [Forward] Layer conv0, param blob 0 data: 0.0467062
I0803 17:01:29.701692 30274 net.cpp:650]     [Forward] Layer conv0, param blob 1 data: 0
I0803 17:01:29.701835 30274 net.cpp:638]     [Forward] Layer relu0, top blob conv0 data: 0.0547961
I0803 17:01:29.702193 30274 net.cpp:638]     [Forward] Layer conv1, top blob conv1 data: 0.0716117
I0803 17:01:29.702267 30274 net.cpp:650]     [Forward] Layer conv1, param blob 0 data: 0.0473551
I0803 17:01:29.702327 30274 net.cpp:650]     [Forward] Layer conv1, param blob 1 data: 0
I0803 17:01:29.702425 30274 net.cpp:638]     [Forward] Layer relu1, top blob conv1 data: 0.0318472
I0803 17:01:29.702781 30274 net.cpp:638]     [Forward] Layer conv16, top blob conv16 data: 0.0403702
I0803 17:01:29.702847 30274 net.cpp:650]     [Forward] Layer conv16, param blob 0 data: 0.0474007
I0803 17:01:29.702908 30274 net.cpp:650]     [Forward] Layer conv16, param blob 1 data: 0
I0803 17:01:29.703228 30274 net.cpp:638]     [Forward] Layer loss, top blob loss data: 11.2245

it is strange that in forward, every layer starting with batchnorm gives 0!!! Also it worth mentioned that Relu (in-place layer) have only 4 lines, but batchnorm and scale (supposed to be also in-place layers) have 6 and 3 lines in log file. Do you know what's the problem.

what version of caffe are you using? – Shai Aug 08 '17 at 06:29 — Shai, Aug 08 '17 at 06:29

score 0 · Answer 1 · answered Aug 08 '17 at 06:47

I don't know what's wrong with your "BatchNorm" layer, but it is very odd:
According to your debug log, your "BatchNorm" layer has 5 (!) internal param blobs (0..4). Looking at the source code of batch_norm_layer.cpp there should be only 3 internal param blobs:

this->blobs_.resize(3);

I suggest you make sure the implementation of "BatchNorm" you are using is not bugous.

About the debug log, you can read here more about how to interpret it.
To address your question

"Relu [...] have only 4 lines, but batchnorm and scale [...] have 6 and 3 lines in log file"

Note that each layer has one line for "top blob ... data" - reporting the L2 norm of the output blob.
Additionally, each layer has an extra line for each of its internal weights. "ReLU" layer has no internal parameters, and thus no prints of "param blob [...] data" for this layer. "Convolution" layer has two internal params (kernels and bias), thus an extra two lines for blob 0 and blob 1.

Gradient vanishes when using batch normalization in caffe

1 Answers1