I'm trying to fit a model with different combinations of the variables in my data. I want to find the best performing way to do so.
There are a lot of similar questions (link, link, link, link, link, to list the most interacted with ones). Most of them get two solutions: mapping over the combinations with a lm call (by either subsetting the data or building a formula), or dredging with MuMIn::dredge
.
Question: are there other options? Which of them is generally the fastest one (including the two above)?
The gist of the question is all above, but to provide assistance, below are a reproducible example, the attempts I made, and a benchmark of them. I tried using tidymodels
, but my first attempt was too slow, and the second (using workflowsets
) didn't work. Any help with those is appreciated. Thank you!
Dummy data:
library(tidyverse)
library(tidymodels)
data = mtcars[,1:5] #dummy data
preds = names(data)[-1] #predictors
preds.combs = map(1:length(preds), ~ combn(preds, .x, simplify = FALSE)) %>% flatten() #list of character vectors with the vars
My attempts:
First option: map over every combination with a lm
call.
test_map = function(){
map(preds.combs, ~ lm(reformulate(.x, "mpg"), data))}
Building a formula seems to be faster because R doesn't need alter any objects, whereas with ~ lm(mpg ~ ., select(data, mpg, all_of(.x))))
, it does.
Second option: dredging with MuMIn::dredge
.
test_dredge = function(){
lm(mpg ~ ., data, na.action = "na.fail") %>% MuMIn::dredge()}
I would like to avoid this method because it doesn't returns the whole models, and doesn't work for every modelling, only the ones supported by the package.
Third option: create an workflow using tidymodels
, and update it for every combination.
test_wflow = function(){
wflow = workflow() %>% add_model(linear_reg()) %>% add_variables(mpg, cyl)
res = list()
for(i in seq_along(preds.combs)){
res[[i]] = wflow %>%
update_variables(mpg, all_of(preds.combs[[i]])) %>%
fit(data)}}
Changing the update method to update_formula
or update_recipe
didn't changed significantly the runtime.
Fourth option: create an workflow set using workflowset
, and fit all the workflows later.
library(workflowset)
test_wflowset = function(){
wflow = workflow_set(
preproc = map(preds.combs, ~ reformulate(.x, "mpg")),
models = list(lm = linear_reg()), cross = TRUE)
fit(wflow)} #this is wrong. how do i fit each workflow in a wflow set?
There is an option to "tune" the parameters of each workflow together with a cross-validation
workflow_map(wflow, resamples = vfold_cv(data, v = 10))
, but I don't want to tune the parameters, I just want to get the fit. As this one isn't working, it's not included in the benchmark.
Benchmark:
bench::mark(test_map(), test_dredge(), test_wflow(),
check = FALSE, min_iterations = 3) %>%
select(c(expression, min, median, mem_alloc, n_itr, n_gc))
expression min median mem_alloc n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <bch:byt> <int> <dbl>
1 test_map() 20.4ms 21.44ms 99.2KB 21 1
2 test_dredge() 46.75ms 48.9ms 110.9KB 11 0
3 test_wflow() 2.12s 2.13s 5MB 3 5