Different learning rate affect to batchnorm setting. Why?

Question

I am using BatchNorm layer. I know the meaning of setting use_global_stats that often set false for training and true for testing/deploy. This is my setting in the testing phase.

layer {
  name: "bnorm1"
  type: "BatchNorm"
  bottom: "conv1"
  top: "bnorm1"
  batch_norm_param {
    use_global_stats: true
  }
}
layer {
  name: "scale1"
  type: "Scale"
  bottom: "bnorm1"
  top: "bnorm1"
  bias_term: true
  scale_param {
    filler {
      value: 1
    }    
    bias_filler {
      value: 0.0
    }
  }
}

In solver.prototxt, I used the Adam method. I found an interesting problem that happens in my case. If I choose base_lr: 1e-3, then I got a good performance when I set use_global_stats: false in the testing phase. However, if I chose base_lr: 1e-4, then I got a good performance when I set use_global_stats: true in the testing phase. It demonstrates that base_lr effects to the batchnorm setting (even I used Adam method)? Could you suggest any reason for that? Thanks all

score 4 · Answer 1 · answered May 29 '17 at 12:53

4

AFAIK learning rate does not directly affect the learned parameters of "BatchNorm" layer. Indeed, caffe forces lr_mult for all internal parameters of this layer to be zero regardless of base_lr or the type of the solver.
However, you might encounter a case where the adjacent layers converge to different points according to the base_lr you are using, and indirectly this causes the "BatchNorm" to behave differently.

answered May 29 '17 at 12:53

Shai

111,146
38
238
371

Thanks for your answer. `the adjacent layers converge to different points ` means the local optimal. Is it right? If it right, maybe the learning rate is too small – KimHee May 29 '17 at 15:00
@KimHee usually with `"Adam"` solver one tends to set `base_lr` a bit higher relative to other solvers. – Shai May 29 '17 at 15:01
Do you think Adam solver is always better for all networks in comparison with SGD? – KimHee May 29 '17 at 15:10
@KimHee I don;t think I'm qualified to answer such question. – Shai May 29 '17 at 15:11

Different learning rate affect to batchnorm setting. Why?

1 Answers1

Linked