Using dataframe name as a column in a model table

Question

I'm confused as to why the following doesn't work. I'm trying to use the name of a data frame/tibble as a column in a multiple models data frame, but keep running up against the following error. Here's an example:

library(tidyverse)
library(rlang)

set.seed(666)
df1 <- tibble(
  x = 1:10 + rnorm(10),
  y = seq(20, 38, by=2) + rnorm(10),
  z = 2*x + 3*y
)

df2 <- tibble(
  x = 1:10 + rnorm(10),
  y = seq(20, 38, by=2) + rnorm(10),
  z = 4*x + 5*y
)

results <- tibble(dataset = c('df1','df2'))

Notice that the following all work:

lm(z ~ x + y, data=df1)
lm(z ~ x + y, data=df2)
lm(z ~ x + y, data=eval(sym('df1')))

But when I try the following:

results <- results %>% mutate(model = lm(z ~ x + y, data = eval(sym(dataset))))

I get the error

Error in mutate_impl(.data, dots) : 
  Evaluation error: Only strings can be converted to symbols.

Can someone figure out how to make this work?

The preferred approach for this sort of thing is to have `df1` and `df2` in a single data frame, with a column delineating the two groups, and then fit the model by group explicitly. — joran, May 16 '18 at 21:02
Yes, I'm aware of that approach, but in reality the data frames are quite large and so manipulating them as a single data frame or as data frame entries in a list column is unwieldy. — David Pepper, May 16 '18 at 21:16

score 2 · Answer 1 · answered May 17 '18 at 00:55

We can use the map function and specify the lm function as the following.

library(tidyverse)
library(rlang)

results2 <- results %>% 
  mutate(model = map(dataset, ~lm(z ~ x + y, data = eval(sym(.)))))

results2
# # A tibble: 2 x 2
#   dataset model   
#   <chr>   <list>  
# 1 df1     <S3: lm>
# 2 df2     <S3: lm>

results2$model[[1]]
# Call:
#   lm(formula = z ~ x + y, data = eval(sym(.)))
# 
# Coefficients:
# (Intercept)            x            y  
#   6.741e-14    2.000e+00    3.000e+00

results2$model[[2]]
# Call:
#   lm(formula = z ~ x + y, data = eval(sym(.)))
# 
# Coefficients:
# (Intercept)            x            y  
#   9.662e-14    4.000e+00    5.000e+00

Thanks very much for the help. Now that I see the answer, I have to figure out why the answer works -- specifically, the meaning of the extra `~` in your code. It's a bit like the "42" from Hitchhiker's Guide. — David Pepper, May 18 '18 at 15:51

score 1 · Accepted Answer · answered Aug 29 '19 at 13:12

I'd recommend a slightly different route where you bind all the data and skip the eval and sym calls. This follows the "Many Models" chapter of R for Data Science.

purrr::lst creates a list of the data frames with the names of those variables as the list's names, and the .id argument to bind_rows uses those names to create a column marking data as coming from df1 or df2. Nesting creates a column data which is a list-column of data frames. Then you can build the models of each dataset. I used the tilde shortcut notation to build the anonymous function.

The result: you have a column model that is a list of models.

library(tidyverse)
library(rlang)

results <- lst(df1, df2) %>%
  bind_rows(.id = "dataset") %>%
  group_by(dataset) %>%
  nest() %>%
  mutate(model = map(data, ~lm(z ~ x + y, data = .)))

results$model[[1]]
#> 
#> Call:
#> lm(formula = z ~ x + y, data = .)
#> 
#> Coefficients:
#> (Intercept)            x            y  
#>   6.741e-14    2.000e+00    3.000e+00

You also still have a column of that nested data. If you don't want it, you can drop it:

select(results, -data)
#> # A tibble: 2 x 2
#>   dataset model 
#>   <chr>   <list>
#> 1 df1     <lm>  
#> 2 df2     <lm>

Using dataframe name as a column in a model table

2 Answers2