1

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.

My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.

For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.

library(glmnet)

data("mtcars") # Base R Dataset
df <- mtcars


topvar <- list()

for (i in 1:100) {
  
  # CV and Splitting
  
  ind <- sample(nrow(df), nrow(df), replace = TRUE)
  ind <- unique(ind)
  
  train <- df[ind, ]
  xtrain <- model.matrix(mpg~., train)[,-1]
  ytrain <- df[ind, 1]
  
  test <- df[-ind, ]
  xtest <- model.matrix(mpg~., test)[,-1]
  ytest <- df[-ind, 1]
  
  # Create Model per Loop
 
  model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2) 
                     
  # Store Coeffecients per loop
  
  coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
  
  # Store all nonzero Coefficients
  
  topvar[[i]] <- coef_las[which(coef_las != 0)]
  
}

# Unlist 

varimp <- unlist(topvar)

# Count all predictors

novar <- table(names(varimp))

# Find the mean of all variables

meanvar <- tapply(varimp, names(varimp), mean)

# Return top 3 repeated Coefs

repvar <- novar[order(novar, decreasing = TRUE)][1:3]

# Return mean of repeated Coefs

repvar.mean <- meanvar[names(repvar)]

repvar

Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?

jad
  • 11
  • 2
  • There are implementations of bagged models working on parsnip, switch to those, append dalex at the end and use feature importance based on permutations – Bruno Mar 22 '22 at 02:31
  • @Bruno So is your suggestion to use Tidymodels and Dalex packages and have similar code structure to avoid any nuances of specific model packages? Would this address the issue of the top features changing for the various models every time I run the code? – jad Mar 22 '22 at 03:15
  • You don't need to know what the n best predictors were in each bagging sample, I would take the average, if this is stats related matter I suggest moving it to cross validated – Bruno Mar 22 '22 at 16:29
  • @Bruno I am taking the average though, see the object titled repvar.mean in my example. That is the average coefficient value for the top repeated predictors in X amount of loops. Or is this not what you are referring to as the average? – jad Mar 22 '22 at 18:03
  • I mean you take all the average of all coefficients on all runs and the you filter the top 3 – Bruno Mar 22 '22 at 22:29
  • @Bruno Do you mean to run my loop, then run my coef_las code outside of the loop? Doing so captures only the last model created. So, for the example above running: Loop {for i in 1:100 } Then running: coef(model, s = 0.2)[-1, ] captures model 100 coefficients only – jad Mar 22 '22 at 22:46

1 Answers1

0

You can use function set.seed() to ensure your sample will return the same sample each time. For example

set.seed(123)

When I add this to above code and then run twice, the following is returned both times:

  wt carb   hp 
  98   89   86
Brian Syzdek
  • 873
  • 6
  • 10
  • I am aware of set seed. However, would this be plausible when trying to discover the top predictors in a model? If I am fixing the sample for every run am I not messing with the fundamentals of CV and splitting train/test? – jad Mar 22 '22 at 03:48