26

Looking at an example 'solver.prototxt', posted on BVLC/caffe git, there is a training meta parameter

weight_decay: 0.04

What does this meta parameter mean? And what value should I assign to it?

rayryeng
  • 102,964
  • 22
  • 184
  • 193
Shai
  • 111,146
  • 38
  • 238
  • 371

2 Answers2

47

The weight_decay meta parameter govern the regularization term of the neural net.

During training a regularization term is added to the network's loss to compute the backprop gradient. The weight_decay value determines how dominant this regularization term will be in the gradient computation.

As a rule of thumb, the more training examples you have, the weaker this term should be. The more parameters you have (i.e., deeper net, larger filters, larger InnerProduct layers etc.) the higher this term should be.

Caffe also allows you to choose between L2 regularization (default) and L1 regularization, by setting

regularization_type: "L1"

However, since in most cases weights are small numbers (i.e., -1<w<1), the L2 norm of the weights is significantly smaller than their L1 norm. Thus, if you choose to use regularization_type: "L1" you might need to tune weight_decay to a significantly smaller value.

While learning rate may (and usually does) change during training, the regularization weight is fixed throughout.

Graham
  • 7,431
  • 18
  • 59
  • 84
Shai
  • 111,146
  • 38
  • 238
  • 371
  • Maybe you could explain the reasons behind your rule of thumb? Do you have a source for that? – Janosch Aug 25 '15 at 11:30
  • 11
    @Janosch usually one needs to use regularization when there are more parameters than constraints on a numerical problem. In learning, training examples represents "constraints". So, if you have (much) more training examples than free parameters, you need to worry less about overfitting and you can reduce the regularization term. However, if you have very few training examples (compared to the number of parameters) then your model is prone to overfitting and you need strong regularization term to prevent this from happening – Shai Aug 25 '15 at 11:33
  • Do you have to set `param { lr_mult: 1 decay_mult: 1 }` in the `convolution` layer or is the regularization type global? @Shai –  Nov 29 '16 at 17:25
  • the global `weight_decay` multiplies the parameter-specific `decay_mult`. @thigi – Shai Nov 29 '16 at 17:29
  • That means when the parameter is `unset` or `zero` this does not have any influence? Which one is the default `regularization` ? And which one would you preferably use? My problem is overfitting -> train loss lower than test loss @Shai –  Nov 29 '16 at 17:31
16

Weight decay is a regularization term that penalizes big weights. When the weight decay coefficient is big the penalty for big weights is also big, when it is small weights can freely grow.

Look at this answer (not specific to caffe) for a better explanation: Difference between neural net "weight decay" and "learning rate".

Community
  • 1
  • 1
Tal Darom
  • 1,379
  • 1
  • 8
  • 26