16

In my dataset I have a number of continuous and dummy variables. For analysis with glmnet, I want the continuous variables to be standardized but not the dummy variables.

I currently do this manually by first defining a dummy vector of columns that have only values of [0,1] and then using the scale command on all the non-dummy columns. Problem is, this isn't very elegant.

But glmnet has a built in standardize argument. By default will this standardize the dummies too? If so, is there an elegant way to tell glmnet's standardize argument to skip dummies?

Dr. Beeblebrox
  • 838
  • 2
  • 13
  • 30
  • Why are you doing all that extra work? – IRTFM Jul 26 '13 at 17:54
  • @DWin I don't see another way. If glmnet doesn't discriminate, then I need to. As I just posted below, if we can't interpret a coefficient on a standardized dummy variable, then I need to separate dummies from non-dummies before standardizing. – Dr. Beeblebrox Aug 05 '13 at 18:28

2 Answers2

13

In short, yes - this will standardize the dummy variables, but there's a reason for doing so. The glmnet function takes a matrix as an input for its X parameter, not a data frame, so it doesn't make the distinction for factor columns which you may have if the parameter was a data.frame. If you take a look at the R function, glmnet codes the standardize parameter internally as

    isd = as.integer(standardize)

Which converts the R boolean to a 0 or 1 integer to feed to any of the internal FORTRAN functions (elnet, lognet, et. al.)

If you go even further by examining the FORTRAN code (fixed width - old school!), you'll see the following block:

          subroutine standard1 (no,ni,x,y,w,isd,intr,ju,xm,xs,ym,ys,xv,jerr)    989
          real x(no,ni),y(no),w(no),xm(ni),xs(ni),xv(ni)                        989
          integer ju(ni)                                                        990
          real, dimension (:), allocatable :: v                                     
          allocate(v(1:no),stat=jerr)                                           993
          if(jerr.ne.0) return                                                  994
          w=w/sum(w)                                                            994
          v=sqrt(w)                                                             995
          if(intr .ne. 0)goto 10651                                             995
          ym=0.0                                                                995
          y=v*y                                                                 996
          ys=sqrt(dot_product(y,y)-dot_product(v,y)**2)                         996
          y=y/ys                                                                997
    10660 do 10661 j=1,ni                                                       997
          if(ju(j).eq.0)goto 10661                                              997
          xm(j)=0.0                                                             997
          x(:,j)=v*x(:,j)                                                       998
          xv(j)=dot_product(x(:,j),x(:,j))                                      999
          if(isd .eq. 0)goto 10681                                              999
          xbq=dot_product(v,x(:,j))**2                                          999
          vc=xv(j)-xbq                                                         1000
          xs(j)=sqrt(vc)                                                       1000
          x(:,j)=x(:,j)/xs(j)                                                  1000
          xv(j)=1.0+xbq/vc                                                     1001
          goto 10691                                                           1002

Take a look at the lines marked 1000 - this is basically applying the standardization formula to the X matrix.

Now statistically speaking, one does not generally standardize categorical variables to retain the interpretability of the estimated regressors. However, as pointed out by Tibshirani here, "The lasso method requires initial standardization of the regressors, so that the penalization scheme is fair to all regressors. For categorical regressors, one codes the regressor with dummy variables and then standardizes the dummy variables" - so while this causes arbitrary scaling between continuous and categorical variables, it's done for equal penalization treatment.

R_User
  • 937
  • 1
  • 9
  • 17
  • 2
    I did some [similar dig up](https://thinklab.com/discussion/computing-standardized-logistic-regression-coefficients/205#5) to confirm the way glmnet was re-transforming the coefficients after fitting on the standardized variables. Funtran :-) – Antoine Lizée May 08 '16 at 00:08
  • from `glmnet`'s help: "The coefficients are always returned on the original scale". So, interpretability of the coefficients should not be an issue. – pbahr Mar 17 '17 at 15:45
  • While the coefficients are "on the original scale", L1 and L2 penalization inherently biases the regressor coefficients to try and reduce variance ( See [Bias-Variance Tradeoff](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff) ), meaning that they shouldn't be treated as unbiased estimates of effect on the value of the dependent variable. Just a clarification :) – R_User Oct 03 '17 at 21:29
3

glmnet doesn't know anything about dummy variables, because it doesn't have a formula interface (and hence doesn't touch model.frame and model.matrix.) If you want them to be treated specially, you'll have to do it yourself.

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • Is it OK to let the dummies be standardized? – Dr. Beeblebrox Jul 26 '13 at 19:20
  • 2
    Answering my own question, above. **No, it is not OK to standardize dummies.** Quoting http://www.sagepub.com/upm-data/21120_Chapter_7.pdf, page 140: "an unstandardized coefficient for a dummy regressor is interpretable as the expected response-variable difference between a particular category and the baseline category for the dummy-regressor set (controlling, of course, for the other explanatory variables in the model). If a dummy-regressor coefficient is standardized, then this straightforward interpretation is lost." – Dr. Beeblebrox Aug 05 '13 at 18:26
  • @R_User Do you want to add an answer based on your comment? I could then accept yours as the answer. – Dr. Beeblebrox Oct 23 '14 at 15:13