-1

I have the following dataframe:

Index <- seq.int(1:10)
A <- c(5, 5, 3, 4, 3, 3, 2, 2, 4, 3)
B <- c(10, 11, 12, 12, 12, 11, 13, 13, 14, 13)
C <- c(7, 6, 7, 7, 6, 5, 6, 5, 5, 4)
df <- data.frame(Index, A, B, C)
> df
      Index A  B C
 [1,]     1 5 10 7
 [2,]     2 5 11 6
 [3,]     3 3 12 7
 [4,]     4 4 12 7
 [5,]     5 3 12 6
 [6,]     6 3 11 5
 [7,]     7 2 13 6
 [8,]     8 2 13 5
 [9,]     9 4 14 5
[10,]    10 3 13 4

I would like to generate linear models (and ultimately obtain slopes, intercepts, and coefficients of determination in an easy-to-work-with dataframe form) with the Index column as the dependent variable and with all of the other columns as the response variable, separately. I know I can do this by running the following line of code:

summary(lm(cbind(A, B, C) ~ Index, data = df))

One issue I have with the above line of code is that it uses the cbind function, and thus, I have to input each column separately. I am working with a large dataframe with many columns, and instead of using the cbind function, I'd love to be able to tell the function to use a bunch of columns (i.e., response variables) at once by writing something like df[, 2:ncol(df)] in place of cbind(A, B, C).

Another issue I have with the above line of code is that the output is not really in a user-friendly form. Ultimately, I would like the output (slopes, intercepts, and coefficients of determination) to be in an easy-to-work-with dataframe form:

response <- c("A", "B", "C")
slope <- c(-0.21818, 0.33333, -0.29091)
intercept <- c(4.60000, 10.26667, 7.40000)
r.squared <- c(0.3776, 0.7106, 0.7273)
summary_df <- data.frame(response, slope, intercept, r.squared)
> summary_df
  response    slope intercept r.squared
1        A -0.21818   4.60000    0.3776
2        B  0.33333  10.26667    0.7106
3        C -0.29091   7.40000    0.7273

What is the most efficient way to do this? There must be a solution using the lapply function that I'm just not getting. Thanks so much!

David Moore
  • 670
  • 3
  • 15
  • 3
    check out the `broom` package ... – Ben Bolker Oct 24 '18 at 02:54
  • @李哲源 - it might be worth adding the `as.matrix` option from my answer below to your excellent detailed answer on the previous duplicate - https://stackoverflow.com/a/39279431/496803 . I think it does the multiple LHS in the same way, and avoids fiddling with formulas manually. I'll delete my answer then and that will clean things up a bit. – thelatemail Oct 24 '18 at 03:42
  • @thelatemail Using matrix variable can mess things up when you use `.` in the formula. Try `lm(as.matrix(trees[1:2]) ~ ., trees)`, although we really want to do `lm(as.matrix(trees[1:2]) ~ Volume, trees)`. Mixing matrix variable and formula is not a good idea. Either use formula only, or use matrix only. – Zheyuan Li Oct 24 '18 at 04:02
  • I don't disagree - `ids <- 1:2; lm(as.matrix(df[ids]) ~ as.matrix(df[-ids]), data=df)` or similar would still be a handy addition I reckon. – thelatemail Oct 24 '18 at 04:25

2 Answers2

0

To address the first part of your query, you can pass matrix objects to lm formula sides:

summary(lm(as.matrix(df[-1]) ~ as.matrix(df[1])))

Checks out in terms of the reported coefficients:

all.equal(
  coef(lm(as.matrix(df[-1]) ~ as.matrix(df[1]))),
  coef(lm(cbind(A,B,C) ~ Index, data=df)),
  check.attributes=FALSE
)
#[1] TRUE

Note the warning from 李哲源 that combining this like matrix(...) ~ . will not work as intended. It may generally be safer to specify both sides as expressions, or both sides as a matrix only.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
0

I would convert the data frame to a tibble. This allows you to use list columns as described in this presentation, to store and manipulate your models.

Let's call the data frame df1, not df. Convert to a tibble, then use tidyr::gather() and tidyr::nest to reshape it:

library(tidyverse)
library(broom)

df1 %>% 
  as.tibble() %>% 
  gather(Var, Val, -Index) %>% 
  nest(-Var)

The result is a tibble with a row for each of A, B, C and a column data which stores the Index column and the corresponding value, Val, for each of A, B, C.

# A tibble: 3 x 2
  Var   data             
  <chr> <list>           
1 A     <tibble [10 x 2]>
2 B     <tibble [10 x 2]>
3 C     <tibble [10 x 2]>

Now we can use dplyr::mutate() and purrr::map to create a column containing the models for each of A, B and C.

df1 %>% 
  as.tibble() %>% 
  gather(Var, Val, -Index) %>% 
  nest(-Var) %>% 
  mutate(model = map(data, ~lm(Index ~ Val, .)))

# A tibble: 3 x 3
  Var   data              model   
  <chr> <list>            <list>  
1 A     <tibble [10 x 2]> <S3: lm>
2 B     <tibble [10 x 2]> <S3: lm>
3 C     <tibble [10 x 2]> <S3: lm>

Finally we can use broom::glance() or broom::tidy() to extract the required values from the models, then tidyr::unnest() to get back to a regular tibble.

Using glance:

df1 %>% 
  as.tibble() %>% 
  gather(Var, Val, -Index) %>% 
  nest(-Var) %>% 
  mutate(model = map(data, ~lm(Index ~ Val, .)), 
         summary = map(model, glance)) %>% 
  unnest(summary) %>% 
  select(-data, -model)

# A tibble: 3 x 12
  Var   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC deviance df.residual
  <chr>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
1 A         0.378         0.300  2.53      4.85 0.0587      2  -22.4  50.7  51.6     51.3           8
2 B         0.711         0.674  1.73     19.6  0.00219     2  -18.5  43.1  44.0     23.9           8
3 C         0.727         0.693  1.68     21.3  0.00171     2  -18.2  42.5  43.4     22.5           8

Using tidy:

df1 %>% 
  as.tibble() %>% 
  gather(Var, Val, -Index) %>% 
  nest(-Var) %>% 
  mutate(model = map(data, ~lm(Index ~ Val, .)), 
         summary = map(model, tidy)) %>% 
  unnest(summary)

# A tibble: 6 x 6
  Var   term        estimate std.error statistic  p.value
  <chr> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 A     (Intercept)    11.4      2.79       4.08 0.00352 
2 A     Val            -1.73     0.786     -2.20 0.0587  
3 B     (Intercept)   -20.3      5.85      -3.47 0.00842 
4 B     Val             2.13     0.481      4.43 0.00219 
5 C     (Intercept)    20        3.18       6.28 0.000237
6 C     Val            -2.5      0.541     -4.62 0.00171
neilfws
  • 32,751
  • 5
  • 50
  • 63