weight decay in caffe. How exactly is it used?

Question

From some old discussions (link1, link2) I got the idea that 'weight_decay' parameter is the regularization parameter for L2 loss over the weights. For example, in the cifar10 solver, the weight_decay value is 0.004. Does it mean the loss to be minimized is is "cross-entropy + 0.004*sum_of_L2_Norm_of_all_weights"? Is it, by any chance, "cross-entropy + 0.004/2*sum_of_L2_Norm_of_all_weights"?

score 0 · Answer 1 · answered Feb 19 '17 at 16:06

The loss seems to be cross-entropy+0.004/2*sum_of_L2_Norm_of_all_weights.

Looking at the official caffe implementation of AlexNet, the solver file (https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/solver.prototxt) sets weight_decay=0.0005, while in the original AlexNet paper (http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf, page 6) the gradient update includes the term

-0.0005*e*w_i

Since the gradient is the partial derivative of the loss, and the regularization component of the loss is usually expressed as lambda*||w||^2, it seems as if

weight_decay=2*lambda

weight decay in caffe. How exactly is it used?

1 Answers1