To get grouped results in R

Question

Here I have like below data set. In SAS system, to get grouped results in whatever I do such as means, GLM or REG, I can cord as:

proc sort; by B;
proc glm;
class A;
model C=A; by B;
run;

Then, I can get GLM results within or by B level. But, I DO NOT know how to use like'by' to grouping in R system. You might want to suggest me to use >subset(), however, this cord would be really difficult when you have, for example, 10 levels of B. You might want to learn me but only anova analysis but also regression and average. Here is anyone to help me?

raw data

A   B   C
a   a   0.47
a   b   0.88
a   c   2.32
a   d   3.26
a   a   0.93
a   b   1.86
a   c   3.22
a   d   0.92
a   a   0.45
a   b   0.92
a   c   2.31
a   d   3.24
b   a   0.91
b   b   1.84
b   c   3.27
b   d   0.86
b   a   0.47
b   b   0.90
b   c   2.33
b   d   3.19
b   a   0.92
b   b   1.84
b   c   3.25
b   d   0.93
c   a   0.45
c   b   0.92
c   c   2.33
c   d   3.08
c   a   0.93
c   b   1.86
c   c   3.25
c   d   0.93
c   a   0.47
c   b   0.90
c   c   2.26
c   d   3.09

score 2 · Answer 1 · answered Oct 12 '12 at 17:37

Making your dataset easier to import to R:

dat <-
structure(list(A = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", 
"b", "c"), class = "factor"), B = structure(c(1L, 2L, 3L, 4L, 
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L
), .Label = c("a", "b", "c", "d"), class = "factor"), C = c(0.47, 
0.88, 2.32, 3.26, 0.93, 1.86, 3.22, 0.92, 0.45, 0.92, 2.31, 3.24, 
0.91, 1.84, 3.27, 0.86, 0.47, 0.9, 2.33, 3.19, 0.92, 1.84, 3.25, 
0.93, 0.45, 0.92, 2.33, 3.08, 0.93, 1.86, 3.25, 0.93, 0.47, 0.9, 
2.26, 3.09)), .Names = c("A", "B", "C"), class = "data.frame", row.names = c(NA, 
-36L))

Much of SAS by group processing maps to split-apply-combine methods (divide the data into parts, do something with each part, put those parts back together in some way). In this case, the results of models are objects (lists), and the natural way to "combine" multiple models is to put them in a list.

library("plyr")
models <- dlply(dat, .(B), function(DF) glm(C~A, data=DF))

models is now a list, each element of which is the result of a glm on a subset of dim.

> models
$a

Call:  glm(formula = C ~ A, data = DF)

Coefficients:
(Intercept)           Ab           Ac  
  6.167e-01    1.500e-01    6.799e-17  

Degrees of Freedom: 8 Total (i.e. Null);  6 Residual
Null Deviance:      0.472 
Residual Deviance: 0.427        AIC: 6.107 

$b

Call:  glm(formula = C ~ A, data = DF)

Coefficients:
(Intercept)           Ab           Ac  
   1.220000     0.306667     0.006667  

Degrees of Freedom: 8 Total (i.e. Null);  6 Residual
Null Deviance:      1.99 
Residual Deviance: 1.806        AIC: 19.09 

$c

Call:  glm(formula = C ~ A, data = DF)

Coefficients:
(Intercept)           Ab           Ac  
   2.616667     0.333333    -0.003333  

Degrees of Freedom: 8 Total (i.e. Null);  6 Residual
Null Deviance:      1.958 
Residual Deviance: 1.733        AIC: 18.72 

$d

Call:  glm(formula = C ~ A, data = DF)

Coefficients:
(Intercept)           Ab           Ac  
     2.4733      -0.8133      -0.1067  

Degrees of Freedom: 8 Total (i.e. Null);  6 Residual
Null Deviance:      11.4 
Residual Deviance: 10.23        AIC: 34.69 

attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
  B
1 a
2 b
3 c
4 d

Extraction of information from all models at once follows the same paradigm:

> ldply(models, coefficients)
  B (Intercept)         Ab            Ac
1 a   0.6166667  0.1500000  6.798700e-17
2 b   1.2200000  0.3066667  6.666667e-03
3 c   2.6166667  0.3333333 -3.333333e-03
4 d   2.4733333 -0.8133333 -1.066667e-01

Thank you so much Brian, but I am still wondering when you import data set, you might cord **read.table**, because you could control vast size of data. With you suggestion, really thanks, however, not to easy to control large size of data. I actually don't know if there is the cord I have asked. — Lim Hyungwoo, Oct 12 '12 at 19:18
@LimHyungwoo I think "cord" is not the word you mean. The way Brian set the data up for import is very nice for making a short, reproducible example because it can be copy/pasted into R. See http://stackoverflow.com/q/5963269/903061 — Gregor Thomas, Oct 12 '12 at 19:27
@LimHyungwoo I also am confused by "cord", but what I showed is not how I would generally read data into R. In fact, I used `read.table` on the text you gave to get that data.frame. As @shujaa said, I included it in order to make an easily reproducible example (that is, to quickly get to the point of answering the specific question, not to spend time on importing data). For ways to import data into R, see the "R Data Import/Export" (comes with R or is available on the website). — Brian Diggs, Oct 12 '12 at 19:59
@BrianDiggs and shujaa, Thanks you guys. Your main suggestion was that I can use the package, plyr and Idply. Cheers, — Lim Hyungwoo, Oct 13 '12 at 07:32

Jilber Urbina · Accepted Answer · 2012-10-13T09:09:11.567

Based on @Brian Diggs' answer but using R base functions. I also used Brian's data set

models <- lapply(split(dat, dat$B), function(x) glm(C~A, data=x)) 
do.call(rbind, lapply(models, function(y) y$coef))
  (Intercept)         Ab            Ac
a   0.6166667  0.1500000  1.802585e-17
b   1.2200000  0.3066667  6.666667e-03
c   2.6166667  0.3333333 -3.333333e-03
d   2.4733333 -0.8133333 -1.066667e-01

Edit 1: Results with P.values (alternative 1)

models <- lapply(split(dat, dat$B), function(x) glm(C~A, data=x)) #the same as above 
Coef <- lapply(models, function(y) y$coef) # the same as above
Pval <-  lapply(models, function(z) summary(z)$coefficients[, 'Pr(>|t|)'])
Result <- cbind(do.call(rbind, Coef), do.call(rbind, Pval))
colnames(Result)[4:6] <- paste('P.val', colnames(Result)[4:6])
Result

 (Intercept)         Ab            Ac P.val (Intercept)  P.val Ab  P.val Ac
a   0.6166667  0.1500000  1.802585e-17       0.007088207 0.5167724 1.0000000
b   1.2200000  0.3066667  6.666667e-03       0.008446002 0.5191737 0.9886090
c   2.6166667  0.3333333 -3.333333e-03       0.000151772 0.4762936 0.9941859
d   2.4733333 -0.8133333 -1.066667e-01       0.016803317 0.4744404 0.9235623

Edit 2: Results with P.values (alternative 2)

models <- lapply(split(dat, dat$B), function(x) glm(C~A, data=x)) #the same as above
do.call(rbind, lapply(models, function(z) summary(z)$coefficients))
                 Estimate Std. Error       t value    Pr(>|t|)
(Intercept)  6.166667e-01  0.1540202  4.003804e+00 0.007088207
Ab           1.500000e-01  0.2178175  6.886500e-01 0.516772358
Ac           1.802585e-17  0.2178175  8.275668e-17 1.000000000
(Intercept)  1.220000e+00  0.3167661  3.851423e+00 0.008446002
Ab           3.066667e-01  0.4479749  6.845622e-01 0.519173660
Ac           6.666667e-03  0.4479749  1.488179e-02 0.988608995
(Intercept)  2.616667e+00  0.3103164  8.432253e+00 0.000151772
Ab           3.333333e-01  0.4388537  7.595545e-01 0.476293582
Ac          -3.333333e-03  0.4388537 -7.595545e-03 0.994185937
(Intercept)  2.473333e+00  0.7538543  3.280917e+00 0.016803317
Ab          -8.133333e-01  1.0661110 -7.628974e-01 0.474440418
Ac          -1.066667e-01  1.0661110 -1.000521e-01 0.923562285

Thanks, This is far much clear. But, do you think I could find p-value there? — Lim Hyungwoo, Oct 13 '12 at 07:34

score 0 · Answer 3 · answered Oct 12 '12 at 16:50

0

Maybe the aggregate function is what you're looking for

aggregate(data$C, by=list(data$A), FUN=sum)

This will group your data by the first column, and collapsing column C into the sum for each group in the first column

answered Oct 12 '12 at 16:50

Omar Wagih

8,504
7
59
75

To get grouped results in R

3 Answers3

Edit 1: Results with P.values (alternative 1)

Edit 2: Results with P.values (alternative 2)