0

I have a logistic regression model I've created in tidymodels (R). I'm trying to do feature selection. How can I do feature selection in the tidymodels framework using packages published on CRAN (no development packages, please)?

Everyone just says to do regularized logistic regression, but I need to be able to do inference/have parameter confidence intervals, which regularization can't do.

Aegis
  • 145
  • 10
  • [Feature Engineering with recipes](https://www.tmwr.org/recipes.html) is very good. – Isaiah Feb 28 '23 at 21:30
  • It doesn't have feature selection info relevant to the above, however. Thanks for sharing. – Aegis Feb 28 '23 at 21:33
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Feb 28 '23 at 22:01

1 Answers1

2

We (the tidymodels group) are working on more supervised filtering methods later in 2023. In the meantime, the recipeselectors package is a great tool to use.

One thing though... the standard errors and p-values are most likely not valid if you have searched through a large number of models. The results would be, to some unknown extent, overly optimistic.

You could bootstrap the selection process a large number of times and estimate confidence intervals for the parameters. A big potential issue is that those estimates are probably bi-modal with some percentage of models having a lot of zero values (when they were not selected).

I think that one of the cleanest approaches is to use a Bayesian spike and slab model. You can get excellent inferences from it. It may be computationally expensive, but so is a wrapper function for feature selection.

topepo
  • 13,534
  • 3
  • 39
  • 52
  • I see caret has support for simulated annealing and genetic algorithms. And they take recipes, cool beans. I'll also look at the spike and slab and recipeselectors. Thanks for building these superb packages. Would standard errors and p-values be optimistic even if the models are evaluated using cross validation and a out-of-sample metric (accuracy, for example)? – Aegis Mar 01 '23 at 14:48