I am running lasso regression on a large data set n=1918, p=85 and the coefficients the regression identifies as important - when actually put into a linear model - are very insignificant. And one the other end, lasso deems very significant explanatory "model" variables as having coefficients near 0 and not selecting for them. The dataframe going into LARS is already scaled. Any ideas on why this might occur? Below is an example of what LARS might choose and also a model created by me with actually good explanatory variables using the exact same dataset.
UPDATE: I'm noticing that lasso is choosing all of my temperature variables and assigning them relatively high coefficients (>1) while all the rest of the variables fall between 0 and 1. Not sure why this is occuring.
signif.coefs <- function(lasso, threshold=1){
coefs <- coef(lasso)
signif <- which(abs(coefs[nrow(coefs),]) > threshold)
return(setNames(coefs[nrow(coefs),signif], signif))
}
signif.coefs(lasso)
4 45
4.855257 -3.020055
lm(response ~ SP.MTMEAN + YEAR, data=df, na.action=na.pass) ###Terrible Lasso Chosen Model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.16710 0.07190 -2.324 0.02022 *
SP.MTMEAN 0.09889 0.02313 4.275 2.01e-05 ***
YEAR 0.14097 0.04580 3.078 0.00211 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9903 on 1915 degrees of freedom
Multiple R-squared: 0.01678, Adjusted R-squared: 0.01576
F-statistic: 16.34 on 2 and 1915 DF, p-value: 9.167e-08
###variables chosen by me with model output from same data frame as above
lm(response~log1p.PTL_RESULT+log1p.NTL_RESULT+log1p.PH_RESULT+log1p.EPI.T+SU.MPPT, data=df, na.action=na.pass)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01200 0.01972 0.608 0.54301
log1p.PTL_RESULT 0.20672 0.03104 6.660 3.58e-11 ***
log1p.NTL_RESULT 0.21219 0.03335 6.362 2.49e-10 ***
log1p.PH_RESULT 0.15543 0.02543 6.113 1.18e-09 ***
log1p.EPI.T 0.09869 0.02189 4.508 6.93e-06 ***
SU.MPPT -0.06002 0.02135 -2.811 0.00499 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8596 on 1912 degrees of freedom
Multiple R-squared: 0.2603, Adjusted R-squared: 0.2583
F-statistic: 134.5 on 5 and 1912 DF, p-value: < 2.2e-16