13

I am using nnet function in R to train my neural network. I am not getting what is decay parameter in nnet is? Is this step size to be used in gradient descent mentod or regularization parameter used to overcome overfitting?

blahdiblah
  • 33,069
  • 21
  • 98
  • 152
user395882
  • 665
  • 2
  • 7
  • 16

2 Answers2

10

It's regularization to avoid over-fitting.

From the documentation (pdf):

decay: parameter for weight decay. Default 0.

Further information is available in the authors' book, Modern Applied Statistics with S. Fourth Edition, page 245:

One way to ensure that f is smooth is to restrict the class of estimates, for example, by using a limited number of spline knots. Another way is regularization in which the fit criterion is altered to

E + λC(f)

with a penalty C on the ‘roughness’ of f . Weight decay, specific to neural networks, uses as penalty the sum of squares of the weights wij. ... The use of weight decay seems both to help the optimization process and to avoid over-fitting. (emphasis added)

blahdiblah
  • 33,069
  • 21
  • 98
  • 152
  • 2
    If that was true, setting decay = 0 should result to an overfitted model (with best possible training set accuracy). In my case instead, I got a very bad training set accuracy (about 10%). Giving decay = 1e-4 -> .8294, decay=2e-4 -> .8832, 5e-3 -> .9924, 1e-2 -> .9954, 1e-1 -> .9966, 1 -> .9644. So I thing decay must be a parameter to decrease the learning rate of the optimization function – gd047 Mar 06 '12 at 06:50
  • If you really wanted to be sure, you could look [at the source](http://cran.r-project.org/src/contrib/nnet_7.3-1.tar.gz). The whole thing is less than 700 lines, and with an obvious eye towards comprehensibility. I haven't been awash in neural nets recently enough to follow it readily, but maybe you'll find it easier. – blahdiblah Mar 06 '12 at 08:53
  • http://stats.stackexchange.com/questions/29130/difference-between-neural-net-weight-decay-and-learning-rate – Fernando Nov 21 '14 at 14:51
6

Complementing blahdiblah's answer by looking at the source code I think that parameter weights corresponds to the learning rate of back-propagation (by reading the manual I couldn't understand what it was). Look at the file nnet.c, line 236, inside function fpass :

TotalError += wx * E(Outputs[i], goal[i - FirstOutput]);

here, in a very intuitive nomenclature, E corresponds to the bp error and wx is a parameter passed to the function, which eventually corresponds to the identifier Weights[i].

Also you can be sure that the parameter decay is indeed what it claims to be by going to the lines 317~319 of the same file, inside function VR_dfunc :

for (i = 0; i < Nweights; i++)
    sum1 += Decay[i] * p[i] * p[i];
*fp = TotalError + sum1;

where p corresponds to the connections' weights, which is the exact definition of the weight-decay regularization.

Community
  • 1
  • 1
AronNeewart
  • 461
  • 6
  • 18
  • 1
    Thanks for your helpful answer. I'm trying to set up a sequential model in Keras to mimic the behavior of nnet but they don't seem to match. The "decay" argument in NNET seems to be equivalent to the L2 regularization parameter in Keras, but some of the code in nnet.c is confusing. First question: why is "Decay" indexed at all? Isn't it just a fixed value used to multiply the sum of squared weights? Second question: what is the "slopes" object and why is it multiplied by "Decay"? Third question: does the decay apply to all weights or just the connections between the input and the hidden layer? – Josh Oct 01 '16 at 17:31