2

I'm trying to fit a model with different combinations of the variables in my data. I want to find the best performing way to do so.

There are a lot of similar questions (link, link, link, link, link, to list the most interacted with ones). Most of them get two solutions: mapping over the combinations with a lm call (by either subsetting the data or building a formula), or dredging with MuMIn::dredge.

Question: are there other options? Which of them is generally the fastest one (including the two above)?

The gist of the question is all above, but to provide assistance, below are a reproducible example, the attempts I made, and a benchmark of them. I tried using tidymodels, but my first attempt was too slow, and the second (using workflowsets) didn't work. Any help with those is appreciated. Thank you!

Dummy data:

library(tidyverse)
library(tidymodels)

data = mtcars[,1:5] #dummy data
preds = names(data)[-1] #predictors
preds.combs = map(1:length(preds), ~ combn(preds, .x, simplify = FALSE)) %>% flatten() #list of character vectors with the vars

My attempts:

First option: map over every combination with a lm call.

test_map = function(){
  map(preds.combs, ~ lm(reformulate(.x, "mpg"), data))}

Building a formula seems to be faster because R doesn't need alter any objects, whereas with ~ lm(mpg ~ ., select(data, mpg, all_of(.x)))), it does.

Second option: dredging with MuMIn::dredge.

test_dredge = function(){
  lm(mpg ~ ., data, na.action = "na.fail") %>% MuMIn::dredge()}

I would like to avoid this method because it doesn't returns the whole models, and doesn't work for every modelling, only the ones supported by the package.

Third option: create an workflow using tidymodels, and update it for every combination.

test_wflow = function(){
  wflow = workflow() %>% add_model(linear_reg()) %>% add_variables(mpg, cyl)
  res = list()
  
  for(i in seq_along(preds.combs)){
    res[[i]] = wflow %>%
      update_variables(mpg, all_of(preds.combs[[i]])) %>%
      fit(data)}}

Changing the update method to update_formula or update_recipe didn't changed significantly the runtime.

Fourth option: create an workflow set using workflowset, and fit all the workflows later.

library(workflowset)
test_wflowset = function(){
  wflow = workflow_set(
    preproc = map(preds.combs, ~ reformulate(.x, "mpg")),
    models = list(lm = linear_reg()), cross = TRUE)
  
  fit(wflow)} #this is wrong. how do i fit each workflow in a wflow set?

There is an option to "tune" the parameters of each workflow together with a cross-validation workflow_map(wflow, resamples = vfold_cv(data, v = 10)), but I don't want to tune the parameters, I just want to get the fit. As this one isn't working, it's not included in the benchmark.

Benchmark:

bench::mark(test_map(), test_dredge(), test_wflow(),
            check = FALSE, min_iterations = 3) %>%
  select(c(expression, min, median, mem_alloc, n_itr, n_gc))
    
  expression         min   median mem_alloc n_itr  n_gc
  <bch:expr>    <bch:tm> <bch:tm> <bch:byt> <int> <dbl>
1 test_map()      20.4ms  21.44ms    99.2KB    21     1
2 test_dredge()  46.75ms   48.9ms   110.9KB    11     0
3 test_wflow()     2.12s    2.13s       5MB     3     5

1 Answers1

2

For benchmarking like this, it will be better to use a dataset with a more realistic size. With a tiny dataset like mtcars, the benchmarking is dominated by overhead of getting in/out of fitting. Also, you might find leave_var_out_formulas() helpful for your benchmarking:

library(tidymodels)
data(mlc_churn, package = "modeldata")

set.seed(1)
mlc_split <- initial_split(mlc_churn, strata = churn)
mlc_train <- training(mlc_split)
mlc_test <- testing (mlc_split)

formulas <- leave_var_out_formulas(churn ~ ., data = mlc_train)

bench::mark(
    map(formulas, ~ glm(.x, mlc_train, family = "binomial")),
    map(formulas, ~ workflow(.x, logistic_reg()) %>% fit(mlc_train)),
    iterations = 5, check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression                                                          min median
#>   <bch:expr>                                                      <bch:t> <bch:>
#> 1 map(formulas, ~glm(.x, mlc_train, family = "binomial"))           2.13s  2.23s
#> 2 map(formulas, ~workflow(.x, logistic_reg()) %>% fit(mlc_train))   2.85s  2.97s
#> # … with 3 more variables: `itr/sec` <dbl>, mem_alloc <bch:byt>, `gc/sec` <dbl>

Created on 2022-12-01 with reprex v2.0.2

By design, there is no method supported for fitting a workflow set to a single, unresampled dataset. This is to provide protection against overfitting; in a real ML problem, you would only ever want to fit one final model (chosen via resampling) to your training dataset.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48
  • Hi Julia, is there a way to include programmatically all possible predictor combinations? For instance, in your example: churn ~ state, churn ~ area_core, churn ~ international_plan, churn~ state + area_core, churn ~ state + international_plan, and so on. Thank you. – BqKaXeu29y Aug 30 '23 at 15:51
  • 1
    Hmmm, I don't know of a ready-made way to go, but you might [check out the code for `leave_var_out_formulas()`](https://github.com/tidymodels/workflowsets/blob/main/R/leave_var_out_formulas.R) to see a possible approach. – Julia Silge Aug 30 '23 at 17:25