1

I have 10 linear models where I only need some information, namely: r-squared, p-value, coefficients of slope and intercept. I managed to extract these values (via ridiculously repeating the code). Now, I need to tabulate these values (Info in the columns; the rows listing results from linear models 1-10). Can anyone please help me? I have hundreds more linear models to do. I'm sure there must be a way.

Data file hosted here

Code:

d<-read.csv("example.csv",header=T)

# Subset data
A3G1 <- subset(d, CatChro=="A3G1"); A4G1 <- subset(d, CatChro=="A4G1")
A3G2 <- subset(d, CatChro=="A3G2"); A4G2 <- subset(d, CatChro=="A4G2")
A3G3 <- subset(d, CatChro=="A3G3"); A4G3 <- subset(d, CatChro=="A4G3")
A3G4 <- subset(d, CatChro=="A3G4"); A4G4 <- subset(d, CatChro=="A4G4")
A3G5 <- subset(d, CatChro=="A3G5"); A4G5 <- subset(d, CatChro=="A4G5")
A3D1 <- subset(d, CatChro=="A3D1"); A4D1 <- subset(d, CatChro=="A4D1")
A3D2 <- subset(d, CatChro=="A3D2"); A4D2 <- subset(d, CatChro=="A4D2")
A3D3 <- subset(d, CatChro=="A3D3"); A4D3 <- subset(d, CatChro=="A4D3")
A3D4 <- subset(d, CatChro=="A3D4"); A4D4 <- subset(d, CatChro=="A4D4")
A3D5 <- subset(d, CatChro=="A3D5"); A4D5 <- subset(d, CatChro=="A4D5")

# Fit individual lines
rA3G1 <- lm(Qend~Rainfall, data=A3G1); summary(rA3G1)
rA3D1 <- lm(Qend~Rainfall, data=A3D1); summary(rA3D1)
rA3G2 <- lm(Qend~Rainfall, data=A3G2); summary(rA3G2)
rA3D2 <- lm(Qend~Rainfall, data=A3D2); summary(rA3D2)
rA3G3 <- lm(Qend~Rainfall, data=A3G3); summary(rA3G3)
rA3D3 <- lm(Qend~Rainfall, data=A3D3); summary(rA3D3)
rA3G4 <- lm(Qend~Rainfall, data=A3G4); summary(rA3G4)
rA3D4 <- lm(Qend~Rainfall, data=A3D4); summary(rA3D4)
rA3G5 <- lm(Qend~Rainfall, data=A3G5); summary(rA3G5)
rA3D5 <- lm(Qend~Rainfall, data=A3D5); summary(rA3D5)

rA4G1   <- lm(Qend~Rainfall, data=A4G1); summary(rA4G1)
rA4D1   <- lm(Qend~Rainfall, data=A4D1); summary(rA4D1)
rA4G2   <- lm(Qend~Rainfall, data=A4G2); summary(rA4G2)
rA4D2   <- lm(Qend~Rainfall, data=A4D2); summary(rA4D2)
rA4G3   <- lm(Qend~Rainfall, data=A4G3); summary(rA4G3)
rA4D3   <- lm(Qend~Rainfall, data=A4D3); summary(rA4D3)
rA4G4   <- lm(Qend~Rainfall, data=A4G4); summary(rA4G4)
rA4D4   <- lm(Qend~Rainfall, data=A4D4); summary(rA4D4)
rA4G5   <- lm(Qend~Rainfall, data=A4G5); summary(rA4G5)
rA4D5   <- lm(Qend~Rainfall, data=A4D5); summary(rA4D5)

# Gradient
summary(rA3G1)$coefficients[2,1]
summary(rA3D1)$coefficients[2,1]
summary(rA3G2)$coefficients[2,1]
summary(rA3D2)$coefficients[2,1] 
summary(rA3G3)$coefficients[2,1] 
summary(rA3D3)$coefficients[2,1] 
summary(rA3G4)$coefficients[2,1] 
summary(rA3D4)$coefficients[2,1] 
summary(rA3G5)$coefficients[2,1] 
summary(rA3D5)$coefficients[2,1]

# Intercept
summary(rA3G1)$coefficients[2,2] 
summary(rA3D1)$coefficients[2,2] 
summary(rA3G2)$coefficients[2,2] 
summary(rA3D2)$coefficients[2,2] 
summary(rA3G3)$coefficients[2,2] 
summary(rA3D3)$coefficients[2,2] 
summary(rA3G4)$coefficients[2,2] 
summary(rA3D4)$coefficients[2,2] 
summary(rA3G5)$coefficients[2,2] 
summary(rA3D5)$coefficients[2,2] 

# r-sq
summary(rA3G1)$r.squared
summary(rA3D1)$r.squared
summary(rA3G2)$r.squared
summary(rA3D2)$r.squared
summary(rA3G3)$r.squared
summary(rA3D3)$r.squared
summary(rA3G4)$r.squared
summary(rA3D4)$r.squared
summary(rA3G5)$r.squared
summary(rA3D5)$r.squared

# adj r-sq
summary(rA3G1)$adj.r.squared
summary(rA3D1)$adj.r.squared
summary(rA3G2)$adj.r.squared
summary(rA3D2)$adj.r.squared
summary(rA3G3)$adj.r.squared
summary(rA3D3)$adj.r.squared
summary(rA3G4)$adj.r.squared
summary(rA3D4)$adj.r.squared
summary(rA3G5)$adj.r.squared
summary(rA3D5)$adj.r.squared

# p-level
p <- summary(rA3G1)$fstatistic
pf(p[1], p[2], p[3], lower.tail=FALSE) 
p2 <- summary(rA3D1)$fstatistic
pf(p2[1], p2[2], p2[3], lower.tail=FALSE) 
p3 <- summary(rA3G2)$fstatistic
pf(p3[1], p3[2], p3[3], lower.tail=FALSE) 
p4 <- summary(rA3D2)$fstatistic
pf(p4[1], p4[2], p4[3], lower.tail=FALSE) 
p5 <- summary(rA3G3)$fstatistic
pf(p5[1], p5[2], p5[3], lower.tail=FALSE) 
p6 <- summary(rA3D3)$fstatistic
pf(p6[1], p6[2], p6[3], lower.tail=FALSE) 
p7 <- summary(rA3G4)$fstatistic
pf(p7[1], p7[2], p7[3], lower.tail=FALSE) 
p8 <- summary(rA3D4)$fstatistic
pf(p8[1], p8[2], p8[3], lower.tail=FALSE) 
p9 <- summary(rA3G5)$fstatistic
pf(p9[1], p9[2], p9[3], lower.tail=FALSE) 
p10 <- summary(rA3D5)$fstatistic
pf(p10[1], p10[2], p10[3], lower.tail=FALSE) 

This is the structure of my expected outcome:

Expected outcome

Is there any way to achieve this?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Candice
  • 67
  • 7
  • 1
    package broom, functions `tidy` and `glance` – phiver Dec 26 '18 at 14:16
  • 2
    In addition to `broom`, you can probably simplify this into one line by using a `group_by` on your dataframe. However, without having a minimal reproducable example, it is difficult to help. – coffeinjunky Dec 26 '18 at 14:20
  • I tried package broom, functions tidy gives the coefficients; while glance gives the r-squared and p-value. The information is scattered. And that is just for one lm. Now, that doesn't matter because i have extracted the information i want manually. But now, how to i tabulate this? – Candice Dec 26 '18 at 14:22
  • To give you some inspiration on putting all this into one or two lines, see e.g. https://stackoverflow.com/questions/37395059/running-several-linear-regressions-from-a-single-dataframe-in-r/37401209#37401209 – coffeinjunky Dec 26 '18 at 14:25
  • 2
    As to "how to tabulate this", what do you mean with "tabulate"? Can you show us your desired outcome? Maybe even provide some sample data? – coffeinjunky Dec 26 '18 at 14:29
  • Thank you people for your willingness to help. Sorry if the question was not well asked. I have edited the question and included the data file and a view of the expected outcome. Thank you once again. – Candice Dec 26 '18 at 14:41
  • 1
    Note - I think there is an error in your original code, which many (not all) of the answers below have replicated. The intercept is not at `coefficients[2,2]`, but at `coefficients[1,1]` – dww Dec 26 '18 at 18:11
  • @dww - thanks for catching the typo in the OP. I corrected my answer. – Len Greski Dec 26 '18 at 19:07

5 Answers5

1

Using library(data.table) you can do

d <- fread("example.csv")
d[, .(
  r2         = (fit <- summary(lm(Qend~Rainfall)))$r.squared,
  adj.r2     = fit$adj.r.squared,
  intercept  = fit$coefficients[1,1], 
  gradient   = fit$coefficients[2,1],
  p.value    = {p <- fit$fstatistic; pf(p[1], p[2], p[3], lower.tail=FALSE)}),
  by  = CatChro]

#    CatChro         r2       adj.r2   intercept     gradient      p.value
# 1:    A3G1 0.03627553  0.011564648 0.024432020 0.0001147645 0.2329519751
# 2:    A3D1 0.28069553  0.254054622 0.011876543 0.0004085644 0.0031181110
# 3:    A3G2 0.06449971  0.041112205 0.026079409 0.0004583538 0.1045970987
# 4:    A3D2 0.03384173  0.005425311 0.023601325 0.0005431693 0.2828170556
# 5:    A3G3 0.07587433  0.054383038 0.043537869 0.0006964512 0.0670399684
# 6:    A3D3 0.04285322  0.002972105 0.022106960 0.0001451185 0.3102578215
# 7:    A3G4 0.17337420  0.155404076 0.023706652 0.0006442175 0.0032431299
# 8:    A3D4 0.37219027  0.349768492 0.009301843 0.0006614213 0.0003442445
# 9:    A3G5 0.17227383  0.150491566 0.025994831 0.0006658466 0.0077413595
#10:    A3D5 0.04411669 -0.008987936 0.014341399 0.0001084626 0.3741011769
dww
  • 30,425
  • 5
  • 68
  • 111
1

Consider building a matrix of lm results. First create a defined function to handle your generalized data frame model build with results extraction. Then, call by which can subset your data frame by a factor column and pass each subset into defined method. Finally, rbind all grouped matrices together for a singular output

lm_results <- function(df) {

  model <- lm(Qend ~ Rainfall, data = df)
  res <- summary(model)

  p <- res$fstatistic

  c(gradient = res$coefficients[2,1],
    intercept = res$coefficients[2,2],
    r_sq = res$r.squared,
    adj_r_sq = res$adj.r.squared,
    f_stat = p[['value']],
    p_value = unname(pf(p[1], p[2], p[3], lower.tail=FALSE))
  )
}

matrix_list <- by(d, d$group, lm_results)

final_matrix <- do.call(rbind, matrix_list)

To demonstrate on random, seeded data

set.seed(12262018)
data_tools <- c("sas", "stata", "spss", "python", "r", "julia")

d <- data.frame(
  group = sample(data_tools, 500, replace=TRUE),
  int = sample(1:15, 500, replace=TRUE),
  Qend = rnorm(500) / 100,
  Rainfall = rnorm(500) * 10
)

Results

mat_list <- by(d, d$group, lm_results)

final_matrix <- do.call(rbind, mat_list)
final_matrix

#             gradient    intercept        r_sq     adj_r_sq    f_stat    p_value
# julia  -1.407313e-04 1.203832e-04 0.017219149  0.004619395 1.3666258 0.24595273
# python -1.438116e-04 1.125170e-04 0.018641512  0.007230367 1.6336233 0.20464162
# r       2.031717e-04 1.168037e-04 0.041432175  0.027738349 3.0256098 0.08635510
# sas    -1.549510e-04 9.067337e-05 0.032476668  0.021355710 2.9203121 0.09103619
# spss    9.326656e-05 1.068516e-04 0.008583473 -0.002682623 0.7618853 0.38511469
# stata  -7.079514e-05 1.024010e-04 0.006013841 -0.006568262 0.4779679 0.49137093
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Hey... Thank you very much for the answer. I can follow this. However, there was a problem once it reached the last two lines. I got the following error message: > matrix_list <- by(d, d$group, lm_results) "Error in names(IND) <- deparse(substitute(INDICES))[1L] : 'names' attribute [1] must be the same length as the vector [0]" > final_matrix <- do.call(rbind, matrix_list) "Error in do.call(rbind, matrix_list) : object 'matrix_list' not found" The random, seeded data however worked well. – Candice Dec 27 '18 at 01:58
  • I hope you used your data's *group* in `by` call: `by(d, d$CatChro, lm_results)`. Sorry I could not test with your actual data as link was forbidden on work machine. We always ask especially for future readers to include sample of data inside the body of post. R's `dput(head(mydata))` is great for this. – Parfait Dec 27 '18 at 04:03
1

Here is a base R solution:

data <- read.csv("./data/so53933238.csv",header=TRUE)

# split by value of CatChro into a list of datasets
dataList <- split(data,data$CatChro)

# process the list with lm(), extract results to a data frame, write to a list
lmResults <- lapply(dataList,function(x){
     y <- summary(lm(Qend ~ Rainfall,data = x))
     Intercept <- y$coefficients[1,1]
     Slope <- y$coefficients[2,1]
     rSquared <- y$r.squared
     adjRSquared <- y$adj.r.squared
     f <- y$fstatistic[1]
     pValue <- pf(y$fstatistic[1],y$fstatistic[2],y$fstatistic[3],lower.tail = FALSE)
     data.frame(Slope,Intercept,rSquared,adjRSquared,pValue)
})
lmResultTable <- do.call(rbind,lmResults)
# add CatChro indicators
lmResultTable$catChro <- names(dataList)

lmResultTable 

...and the output:

    > lmResultTable
            Slope   Intercept   rSquared  adjRSquared       pValue catChro
A3D1 0.0004085644 0.011876543 0.28069553  0.254054622 0.0031181110    A3D1
A3D2 0.0005431693 0.023601325 0.03384173  0.005425311 0.2828170556    A3D2
A3D3 0.0001451185 0.022106960 0.04285322  0.002972105 0.3102578215    A3D3
A3D4 0.0006614213 0.009301843 0.37219027  0.349768492 0.0003442445    A3D4
A3D5 0.0001084626 0.014341399 0.04411669 -0.008987936 0.3741011769    A3D5
A3G1 0.0001147645 0.024432020 0.03627553  0.011564648 0.2329519751    A3G1
A3G2 0.0004583538 0.026079409 0.06449971  0.041112205 0.1045970987    A3G2
A3G3 0.0006964512 0.043537869 0.07587433  0.054383038 0.0670399684    A3G3
A3G4 0.0006442175 0.023706652 0.17337420  0.155404076 0.0032431299    A3G4
A3G5 0.0006658466 0.025994831 0.17227383  0.150491566 0.0077413595    A3G5
>

To render the output in a tabular format in HTML, one can use knitr::kable().

library(knitr)
kable(lmResultTable[1:5],row.names=TRUE,digits=5) 

...which produces the following output after rendering the Markdown:

enter image description here

Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • 1
    Thank you very much! This is quite compact and worked well. And the additional table format is a huge bonus. – Candice Dec 27 '18 at 01:53
1

Here in only a couple of lines:

library(tidyverse)
library(broom)
# create grouped dataframe:
df_g <- df %>% group_by(CatChro)
df_g %>% do(tidy(lm(Qend ~ Rainfall, data = .))) %>% 
   select(CatChro, term, estimate) %>% spread(term, estimate) %>% 
   left_join(df_g %>% do(glance(lm(Qend ~ Rainfall, data = .))) %>%
   select(CatChro, r.squared, adj.r.squared, p.value), by = "CatChro")

And the result will be:

# A tibble: 10 x 6
# Groups:   CatChro [?]
   CatChro `(Intercept)` Rainfall r.squared adj.r.squared  p.value
   <chr>           <dbl>    <dbl>     <dbl>         <dbl>    <dbl>
 1 A3D1          0.0119  0.000409    0.281        0.254   0.00312 
 2 A3D2          0.0236  0.000543    0.0338       0.00543 0.283   
 3 A3D3          0.0221  0.000145    0.0429       0.00297 0.310   
 4 A3D4          0.00930 0.000661    0.372        0.350   0.000344
 5 A3D5          0.0143  0.000108    0.0441      -0.00899 0.374   
 6 A3G1          0.0244  0.000115    0.0363       0.0116  0.233   
 7 A3G2          0.0261  0.000458    0.0645       0.0411  0.105   
 8 A3G3          0.0435  0.000696    0.0759       0.0544  0.0670  
 9 A3G4          0.0237  0.000644    0.173        0.155   0.00324 
10 A3G5          0.0260  0.000666    0.172        0.150   0.00774 

So, how does this work?

The following creates a dataframe with all coefficients and the corresponding statistics (tidy turns the result of lm into a dataframe):

df_g %>%
  do(tidy(lm(Qend ~ Rainfall, data = .)))
A tibble: 20 x 6
Groups:   CatChro [10]
   CatChro term        estimate std.error statistic      p.value
   <chr>   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
 1 A3D1    (Intercept) 0.0119   0.00358       3.32  0.00258     
 2 A3D1    Rainfall    0.000409 0.000126      3.25  0.00312     
 3 A3D2    (Intercept) 0.0236   0.00928       2.54  0.0157      
 4 A3D2    Rainfall    0.000543 0.000498      1.09  0.283       

I understand that you want to have the intercept and the coefficient on Rainfall as individual columns, so let's "spread" them out. This is achieved by first selecting the relevant columns, and then invoking tidyr::spread, as in

select(CatChro, term, estimate) %>% spread(term, estimate)

This gives you:

df_g %>% do(tidy(lm(Qend ~ Rainfall, data = .))) %>% 
  select(CatChro, term, estimate) %>% spread(term, estimate)
A tibble: 10 x 3
Groups:   CatChro [10]
   CatChro `(Intercept)` Rainfall
   <chr>           <dbl>    <dbl>
 1 A3D1          0.0119  0.000409
 2 A3D2          0.0236  0.000543
 3 A3D3          0.0221  0.000145
 4 A3D4          0.00930 0.000661

Glance gives you the summary statistics you are looking for, for each model one. The models are indexed by group, here CatChro, so it is easy to just merge them onto the previous dataframe, which is what the rest of the code is about.

coffeinjunky
  • 11,254
  • 39
  • 57
  • Hi @coffeinjunky. Thank you very much. I can follow the code, but i had some problems running it. One of the problem is that apparently, tidyverse cannot be used even after i installed it. i had to install package(tidyr) in addition. But even after that, i got the following error message: "Error in UseMethod("group_by_") : " – Candice Dec 27 '18 at 02:15
  • and also this error message: "Error in eval(lhs, parent, parent) : object 'df_g' not found " – Candice Dec 27 '18 at 02:16
  • Sorry to hear that you have problems, but not sure I know enough to solve them as for me everything is working just fine. In general, if you want to continue working with R, I would recommend you look into the `split-apply-combine` paradigm and familiarize yourself with `tidy data` and `dplyr`. – coffeinjunky Dec 27 '18 at 07:56
1

Another solution, with lme4::lmList. The summary() method for objects produced by lmList does almost everything you want (although it doesn't store p-values, that's something I had to add below).

m <- lme4::lmList(Qend~Rainfall|CatChro,data=d)
s <- summary(m)
pvals <- apply(s$fstatistic,1,function(x) pf(x[1],x[2],x[3],lower.tail=FALSE))
data.frame(intercept=coef(s)[,"Estimate","(Intercept)"],
           slope=coef(s)[,"Estimate","Rainfall"],
           r.squared=s$r.squared,
           adj.r.squared=unlist(s$adj.r.squared),
           p.value=pvals)
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453