1

I am running lasso regression on a large data set n=1918, p=85 and the coefficients the regression identifies as important - when actually put into a linear model - are very insignificant. And one the other end, lasso deems very significant explanatory "model" variables as having coefficients near 0 and not selecting for them. The dataframe going into LARS is already scaled. Any ideas on why this might occur? Below is an example of what LARS might choose and also a model created by me with actually good explanatory variables using the exact same dataset.

UPDATE: I'm noticing that lasso is choosing all of my temperature variables and assigning them relatively high coefficients (>1) while all the rest of the variables fall between 0 and 1. Not sure why this is occuring.

signif.coefs <- function(lasso, threshold=1){
coefs <- coef(lasso)
signif <- which(abs(coefs[nrow(coefs),]) > threshold)
return(setNames(coefs[nrow(coefs),signif], signif))
}
signif.coefs(lasso)
     4        45 
 4.855257 -3.020055

lm(response ~ SP.MTMEAN + YEAR, data=df, na.action=na.pass) ###Terrible Lasso Chosen Model
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.16710    0.07190  -2.324  0.02022 *  
SP.MTMEAN    0.09889    0.02313   4.275 2.01e-05 ***
YEAR         0.14097    0.04580   3.078  0.00211 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9903 on 1915 degrees of freedom
Multiple R-squared:  0.01678,   Adjusted R-squared:  0.01576 
F-statistic: 16.34 on 2 and 1915 DF,  p-value: 9.167e-08

###variables chosen by me with model output from same data frame as above
lm(response~log1p.PTL_RESULT+log1p.NTL_RESULT+log1p.PH_RESULT+log1p.EPI.T+SU.MPPT, data=df, na.action=na.pass) 
Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.01200    0.01972   0.608  0.54301    
log1p.PTL_RESULT  0.20672    0.03104   6.660 3.58e-11 ***
log1p.NTL_RESULT  0.21219    0.03335   6.362 2.49e-10 ***
log1p.PH_RESULT   0.15543    0.02543   6.113 1.18e-09 ***
log1p.EPI.T       0.09869    0.02189   4.508 6.93e-06 ***
SU.MPPT          -0.06002    0.02135  -2.811  0.00499 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8596 on 1912 degrees of freedom
Multiple R-squared:  0.2603,    Adjusted R-squared:  0.2583 
F-statistic: 134.5 on 5 and 1912 DF,  p-value: < 2.2e-16
atosbar
  • 35
  • 6
  • hmmm it's really odd. which package did you use and where is signif.coefs from? – StupidWolf Feb 13 '20 at 18:34
  • its using the lars package v1.2. Apologies, the singnif.coefs is a function to generate the coefficients that pass a set threshold. I've included the code for the function in the code section of the question. – atosbar Feb 13 '20 at 19:54

0 Answers0