Summarize a dataframe by groups

Question

Consider the following dataframe with 4 columns:

df = data.frame(A = rnorm(10), B = rnorm(10), C = rnorm(10), D = rnorm(10))

The columns A, B, C, D belong to different groups, and the groups are defined in a separate dataframe:

groups = data.frame(Class = c("A","B","C","D"), Group = c("G1", "G2", "G2", "G1"))

#> groups
#  Class Group
#1     A    G1
#2     B    G2
#3     C    G2
#4     D    G1

I would like to average elements of the columns that belong to the same group, and get something similar to:

#> res
#            G1          G2
#1  -0.30023039 -0.71075139
#2   0.53053443 -0.12397126
#3   0.21968567 -0.46916160
#4  -1.13775100 -0.61266026
#5   1.30388130 -0.28021734
#6   0.29275876 -0.03994522
#7  -0.09649998  0.59396983
#8   0.71334020 -0.29818438
#9  -0.29830924 -0.47094084
#10 -0.36102888 -0.40181739

where each cell of G1 is the mean of the relative cells of A and D, and each cell of G2 is the mean of the relative cells of B and C, etc.

I was able to achieve this result, but in a rather brute force way:

l = levels(groups$Group)
res = data.frame(matrix(nc = length(levels), nr = nrow(df)))
for(i in 1:length(l)) {
    df.sub = df[which(groups$Group == l[i])]
    res[,i] = apply(df.sub, 1, mean)
}
names(res) <- l

Is there a better way of doing this? In reality, I have more than 20 columns and more than 10 groups.

Thank you!

possible duplicate of [R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate vs](http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega) — gung - Reinstate Monica, Oct 23 '13 at 16:15
Also try this thread: [quick-elegant-way-to-construct-mean-variance-summary-table](http://stackoverflow.com/questions/7449198/). — gung - Reinstate Monica, Oct 23 '13 at 16:17
@gung These threads don't seem to address my problem. The function `ddply` will be helpful should the groups be defined in the same dataframe as the data. This is not the case here. — Mayou, Oct 23 '13 at 16:19

Ricardo Saporta · Accepted Answer · 2013-10-23T16:47:20.200

3

using data.table

library(data.table)
groups <- data.table(groups, key="Group")
DT <- data.table(df)

groups[, rowMeans(DT[, Class, with=FALSE]), by=Group][, setnames(as.data.table(matrix(V1, ncol=length(unique(Group)))), unique(Group))]

             G1         G2
 1: -0.13052091 -0.3667552
 2:  1.17178729 -0.5496347
 3:  0.23115841  0.8317714
 4:  0.45209516 -1.2180895
 5: -0.01861638 -0.4174929
 6: -0.43156831  0.9008427
 7: -0.64026238  0.1854066
 8:  0.56225108 -0.3563087
 9: -2.00405840 -0.4680040
10:  0.57608055 -0.6177605



# Also, make sure you have characters, not factors, 
groups[, Class := as.character(Class)]
groups[, Group := as.character(Group)]

simple base:

 tapply(groups$Class, groups$Group, function(X) rowMeans(df[, X]))

using sapply :

 sapply(unique(groups$Group), function(X) 
     rowMeans(df[, groups[groups$Group==X, "Class"]]) )

edited Oct 23 '13 at 16:47

answered Oct 23 '13 at 16:19

Ricardo Saporta

54,400
17
144
178

@Mariam, `data.frame(out[1], out[2])` (where "out" is the result of Ricardo's very straightforward answer)? Ricardo, +1 – A5C1D2H2I1M1N2O1R2T1 Oct 23 '13 at 16:29
For some reason, I get the following error: `Error in rowMeans(df[, X]) : 'x' must be an array of at least two dimensions` when I use `tapply` on my real dataset. The dimension of my real dataset are 237 and 10 – Mayou Oct 23 '13 at 16:35
@Mariam, it looks like you are conflating upper and lower case letters. In either case, I would recommend the data.table solution – Ricardo Saporta Oct 23 '13 at 16:48
Perfect, that will do it! Thanks! – Mayou Oct 23 '13 at 17:14

score 0 · Answer 2 · answered Oct 23 '13 at 16:33

I would personally go with Ricardo's solution, but another option would be to merge your two datasets first, and then use your preferred method of aggregating.

library(reshape2)

## Retain the "rownames" so we can aggregate by row
temp <- merge(cbind(id = rownames(df), melt(df)), groups, 
              by.x = "variable", by.y = "Class")
head(temp)
#   variable id      value Group
# 1        A  1 -0.6264538    G1
# 2        A  2  0.1836433    G1
# 3        A  3 -0.8356286    G1
# 4        A  4  1.5952808    G1
# 5        A  5  0.3295078    G1
# 6        A  6 -0.8204684    G1

## This is the perfect form for `dcast` to do its work
dcast(temp, id ~ Group, value.var="value", mean)
#    id          G1          G2
# 1   1  0.36611287  1.21537927
# 2  10  0.22889368  0.50592144
# 3   2  0.04042780  0.58598977
# 4   3 -0.22397850 -0.27333780
# 5   4  0.77073788 -2.10202579
# 6   5 -0.52377589  0.87237833
# 7   6 -0.61773147 -0.05053117
# 8   7  0.04656955 -0.08599288
# 9   8  0.33950565 -0.26345809
# 10  9  0.83790336  0.17153557

(Above data using set.seed(1) on your sample "df".

Summarize a dataframe by groups

2 Answers2

using data.table