Joining aggregated values back to the original data frame

Question

One of the design patterns I use over and over is performing a "group by" or "split, apply, combine (SAC)" on a data frame and then joining the aggregated data back to the original data. This is useful, for example, when calculating each county's deviation from the state mean in a data frame with many states and counties. Rarely is my aggregate calculation only a simple mean, but it makes a good example. I often solve this problem the following way:

require(plyr)
set.seed(1)

## set up some data
group1 <- rep(1:3, 4)
group2 <- sample(c("A","B","C"), 12, rep=TRUE) 
values <- rnorm(12)
df <- data.frame(group1, group2, values)

## got some data, so let's aggregate

group1Mean <- ddply( df, "group1", function(x) 
                     data.frame( meanValue = mean(x$values) ) )
df <- merge( df, group1Mean )
df

Which produces nice aggregate data like the following:

> df
   group1 group2   values meanValue
1       1      A  0.48743 -0.121033
2       1      A -0.04493 -0.121033
3       1      C -0.62124 -0.121033
4       1      C -0.30539 -0.121033
5       2      A  1.51178  0.004804
6       2      B  0.73832  0.004804
7       2      A -0.01619  0.004804
8       2      B -2.21470  0.004804
9       3      B  1.12493  0.758598
10      3      C  0.38984  0.758598
11      3      B  0.57578  0.758598
12      3      A  0.94384  0.758598

This works, but are there alternative ways of doing this which improve on readability, performance, etc?

See http://stackoverflow.com/questions/4998846/applying-an-aggregate-function-over-multiple-different-slices/5000040#5000040 — G. Grothendieck, Feb 17 '11 at 15:46

score 18 · Accepted Answer · answered Feb 17 '11 at 15:53

18

One line of code does the trick:

new <- ddply( df, "group1", transform, numcolwise(mean))
new

group1 group2      values    meanValue
1       1      A  0.48742905 -0.121033381
2       1      A -0.04493361 -0.121033381
3       1      C -0.62124058 -0.121033381
4       1      C -0.30538839 -0.121033381
5       2      A  1.51178117  0.004803931
6       2      B  0.73832471  0.004803931
7       2      A -0.01619026  0.004803931
8       2      B -2.21469989  0.004803931
9       3      B  1.12493092  0.758597929
10      3      C  0.38984324  0.758597929
11      3      B  0.57578135  0.758597929
12      3      A  0.94383621  0.758597929

identical(df, new)
[1] TRUE

answered Feb 17 '11 at 15:53

Andrie

176,377
47
447
496

I had forgotten all about `transform`. Obvious in hindsight. But thank you for illustrating `numcolwise` which I was not familiar with. – JD Long Feb 17 '11 at 15:57
Thats a great idiom, but its tricky to do when some variables should be sums and others means. – richiemorrisroe Aug 14 '12 at 17:49
@richiemorrisroe Any more tricky than any other idiom? – Andrie Aug 14 '12 at 20:05
@Andrie no, i was just hoping you'd reply with some obvious solution that I had missed – richiemorrisroe Aug 14 '12 at 20:24

score 13 · Answer 2 · answered Feb 17 '11 at 16:00

I think ave() is more useful here than the plyr call you show (I'm not familiar enough with plyr to know if you can do what you want with plyr directly or not, I would be surprised if you can't!) or the other base R alternatives (aggregate(), tapply()).:

> with(df, ave(values, group1, FUN = mean))
 [1] -0.121033381  0.004803931  0.758597929 -0.121033381  0.004803931
 [6]  0.758597929 -0.121033381  0.004803931  0.758597929 -0.121033381
[11]  0.004803931  0.758597929

You can use within() or transform() to embed this result directly into df:

> df2 <- within(df, meanValue <- ave(values, group1, FUN = mean))
> head(df2)
  group1 group2     values    meanValue
1      1      A  0.4874291 -0.121033381
2      2      B  0.7383247  0.004803931
3      3      B  0.5757814  0.758597929
4      1      C -0.3053884 -0.121033381
5      2      A  1.5117812  0.004803931
6      3      C  0.3898432  0.758597929
> df3 <- transform(df, meanValue = ave(values, group1, FUN = mean))
> all.equal(df2,df3)
[1] TRUE

And if the ordering is important:

> head(df2[order(df2$group1, df2$group2), ])
   group1 group2      values    meanValue
1       1      A  0.48742905 -0.121033381
10      1      A -0.04493361 -0.121033381
4       1      C -0.30538839 -0.121033381
7       1      C -0.62124058 -0.121033381
5       2      A  1.51178117  0.004803931
11      2      A -0.01619026  0.004803931

I didn't realize that other functions could be used with `ave()`... I clearly should have read the docs before. — JD Long, Feb 17 '11 at 16:33

score 13 · Answer 3 · edited Aug 15 '12 at 07:41

In terms of performance, you can do this same kind of operation using the data.table package, which has built in aggregation and is very fast thanks to indices and a C based implementation. For instance, given df already exists from your example:

library("data.table")
dt<-as.data.table(df)
setkey(dt,group1)
dt<-dt[,list(group2,values,meanValue=mean(values)),by=group1]
dt
      group1 group2      values   meanValue
 [1,]      1      A  0.82122120  0.18810771
 [2,]      1      C  0.78213630  0.18810771
 [3,]      1      C  0.61982575  0.18810771
 [4,]      1      A -1.47075238  0.18810771
 [5,]      2      B  0.59390132  0.03354688
 [6,]      2      A  0.07456498  0.03354688
 [7,]      2      B -0.05612874  0.03354688
 [8,]      2      A -0.47815006  0.03354688
 [9,]      3      B  0.91897737 -0.20205707
[10,]      3      C -1.98935170 -0.20205707
[11,]      3      B -0.15579551 -0.20205707
[12,]      3      A  0.41794156 -0.20205707

I have not benchmarked it, but in my experience it is a lot faster.

If you decide to go down the data.table road, which I think is worth exploring if you work with large data sets, you really need to read the docs because there are some differences from data frame that can bite you if you are unaware of them. However, notably data.table generally does work with any function expecting a data frame,as a data.table will claim its type is data frame (data table inherits from data frame).

[ Feb 2011 ]

[ Aug 2012 ] Update from Matthew :

New in v1.8.2 released to CRAN in July 2012 is := by group. This is very similar to the answer above, but adds the new column by reference to dt so there is no copy and no need for a merge step or relisting existing columns to return alongside the aggregate. There is no need to setkey first, and it copes with non-contiguous groups (i.e. groups that aren't grouped together).

This is signficantly faster for large datasets, and has a simple and short syntax :

dt <- as.data.table(df)
dt[, meanValue := mean(values), by = group1]

score 9 · Answer 4 · edited Feb 17 '11 at 16:29

9

Can't you just add x to the function you pass to ddply?

df <- ddply( df, "group1", function(x)
             data.frame( x, meanValue = mean(x$values) ) )

edited Feb 17 '11 at 16:29

JD Long

59,675
58
202
294

answered Feb 17 '11 at 15:54

Joshua Ulrich

173,410
32
338
418

I didn't realize that would work. Nifty. – JD Long Feb 17 '11 at 16:31

Henrik · Answer 5 · 2015-02-20T11:59:36.730

2

A dplyr possibility:

library(dplyr)
df %>% 
  group_by(group1) %>%
  mutate(meanValue = mean(values))

This returns the data frame in the original order. Add arrange(group1) to the pipe if you wish to order by "group1".

edited Feb 20 '15 at 11:59

answered Feb 20 '15 at 11:54

Henrik

65,555
14
143
159

Joining aggregated values back to the original data frame

5 Answers5

Linked

Related