3

I'm having trouble figuring out how to use purrr::map() with mutate(across(...)).

I want to do a linear model and pull out the estimate for the slope of multiple columns as predicted by a single column.

Here is what I'm attempting with an example data set:

mtcars %>%
  mutate(across(-mpg), 
    map(.x, lst(slope = ~lm(.x ~ mpg, data = .x) %>% 
          tidy() %>% 
          filter(term != "(Intercept") %>% 
            pull(estimate)
  )))

The output I'm looking for would be new columns for each non-mpg column with _slope appended to the name, ie cyl_slope

In my actual data, I'll be grouping by another variable as well in case that matters, as I need the slope for each group for each predicted variable. I have this working in a standard mutate doing one variable at a time as follows:

df %>% 
  group_by(unitid) %>% 
  nest() %>% 

  mutate(tuition_and_fees_as_pct_total_rev_slope = map_dbl(data, ~lm(tuition_and_fees_as_pct_total_rev ~ year, data = .x) %>%
               tidy() %>%
                 filter(term == "year") %>%
                 pull(estimate)
               ))

So:

  1. I think my issue is how to pass the column name being predicted into the lm
  2. I don't know if the solution requires nesting or not, so it would be appreciated if in the mtcars example that is considered.
jzadra
  • 4,012
  • 2
  • 26
  • 46
  • In the second case, do you have a reproducible example. The first one didn't work because 1) `across(-mpg)` closed, 2) the `data = .x` should be the original data because the `.x` is a vector – akrun May 25 '21 at 19:31
  • @akrun no I don't unfortunately, because I'm stuck on how to get it to work. – jzadra May 25 '21 at 19:35

2 Answers2

4

If we wanted to do lm on all other columns with independent variable as 'mpg', one option is to loop over the column names of the 'mtcars' except the 'mpg', create the formula with reformulate, apply the lm, convert to a tidy format, filter out the 'Intercept' and select the 'estimate' column

library(dplyr)
library(tidyr)
library(broom)
map_dfc(setdiff(names(mtcars), 'mpg'), ~ 
   lm(reformulate('mpg', response = .x), data = mtcars) %>%
     tidy %>% 
     filter(term != "(Intercept)") %>%
     select(estimate))

-output

# A tibble: 1 x 10
#   estimate...1 estimate...2 estimate...3 estimate...4 estimate...5 estimate...6 estimate...7 estimate...8 estimate...9 estimate...10
#      <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>         <dbl>
#1       -0.253        -17.4        -8.83       0.0604       -0.141        0.124       0.0555       0.0497       0.0588        -0.148

Or this can be done more easily with a matrix as dependent

library(stringr)
lm(as.matrix(mtcars[setdiff(names(mtcars), "mpg")]) ~ mpg, 
       data = mtcars) %>% 
    tidy %>% 
    filter(term != "(Intercept)") %>%
    select(response, estimate) %>%
    mutate(response = str_c(response, '_slope'))

-output

# A tibble: 10 x 2
#   response   estimate
#   <chr>         <dbl>
# 1 cyl_slope   -0.253 
# 2 disp_slope -17.4   
# 3 hp_slope    -8.83  
# 4 drat_slope   0.0604
# 5 wt_slope    -0.141 
# 6 qsec_slope   0.124 
# 7 vs_slope     0.0555
# 8 am_slope     0.0497
# 9 gear_slope   0.0588
#10 carb_slope  -0.148 

Or another option is summarise with across

mtcars %>%
     summarise(across(-mpg, ~ list(lm(reformulate('mpg', 
              response = cur_column())) %>%
                   tidy %>%
                   filter(term != "(Intercept)") %>%
                   pull(estimate)), .names = "{.col}_slope")) %>%
     unnest(everything())
# A tibble: 1 x 10
#  cyl_slope disp_slope hp_slope drat_slope wt_slope qsec_slope vs_slope am_slope gear_slope carb_slope
#      <dbl>      <dbl>    <dbl>      <dbl>    <dbl>      <dbl>    <dbl>    <dbl>      <dbl>      <dbl>
#1    -0.253      -17.4    -8.83     0.0604   -0.141      0.124   0.0555   0.0497     0.0588     -0.148
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks. In the actual data, it isn't all other columns which is why I am using mutate(across(...)) – jzadra May 25 '21 at 19:39
  • 1
    @jzadra you can subset to a matrix and use the same independent variable as in the update – akrun May 25 '21 at 19:40
  • 1
    @jzadra i updated with two more options. Hope it works fo ryou – akrun May 25 '21 at 19:45
  • So your summarize version works with mutate well, the only issue is that it isn't applying the new name (concatenating the original name with the named function) mtcars %>% mutate(across(-mpg, ~ list(slope = lm(reformulate('mpg', response = cur_column())) %>% tidy %>% filter(term != "(Intercept") %>% pull(estimate)))) – jzadra May 25 '21 at 20:48
  • I'm also puzzled by why I'm getting two values returned (IE why it's nested) - there shoudl only be a single value returned. – jzadra May 25 '21 at 20:49
  • 1
    @jzadra If you new names `.names = "{.col}_slope"))` – akrun May 25 '21 at 20:50
  • 1
    @jzadra that is easy to explain `filter(term != "(Intercept)")` and not `filter(term != "(Intercept")` – akrun May 25 '21 at 20:52
2

One option could be:

map_dfr(.x = names(select(mtcars, -c(mpg, vs))),
        ~ mtcars %>%
         group_by(vs) %>%
         nest() %>%
         mutate(variable = .x,
                estimate = map_dbl(data, function(y) lm(!!sym(.x) ~ mpg, data = y) %>% 
                                    tidy() %>%
                                    filter(term != "(Intercept)") %>%
                                    pull(estimate))) %>%
         select(-data))

     vs variable estimate
   <dbl> <chr>       <dbl>
 1     0 cyl       -0.242 
 2     1 cyl       -0.116 
 3     0 disp     -22.5   
 4     1 disp      -8.01  
 5     0 hp       -10.1   
 6     1 hp        -3.26  
 7     0 drat       0.0748
 8     1 drat       0.0529
 9     0 wt        -0.192 
10     1 wt        -0.113 
11     0 qsec      -0.0357
12     1 qsec      -0.0432
13     0 am         0.0742
14     1 am         0.0710
15     0 gear       0.114 
16     1 gear       0.0492
17     0 carb      -0.0883
18     1 carb      -0.0790
tmfmnk
  • 38,881
  • 4
  • 47
  • 67