1

Anybody have any bright ideas on how to do multiple aggregations such as sum and mean with arithmetic on the left hand side of the formula, something like this:

aggregate(A+B ~ C, data=D, FUN=c(sum, mean))

I expect a 3 column result with C, mean(A+B) and sum(A+B).

I've looked at summaryBy from the 'doBy' package but it fails with the arithmetic.

The closest I've found is to create a custom function taking a param and then applying the 2 aggregation functions within it, however, the result is still a bit messy to work with as there are 2 columns, the second containing a list with both aggregations.

aggregate(A+B ~ C, data = D, FUN=function(x) c(s=sum(x), m=mean(x)))

It's tedious, verbose and more computationally expensive doing two aggregations across the same data and then merging those aggregations.

Tjad Clark
  • 552
  • 3
  • 17

3 Answers3

1

You could do it with dplyr like so:

require(dplyr)

D %>% 
  group_by(C) %>% 
  summarise(sum = sum(A+B), 
            mean = mean(A+B))

With sample data:

set.seed(99)
D <- data.frame(A = runif(10), B = runif(10), C= sample(LETTERS[1:3], 10, TRUE))
D %>% group_by(C) %>% summarise(sum = sum(A+B), mean = mean(A+B))
#Source: local data frame [3 x 3]
#
#  C      sum      mean
#1 A 3.596430 0.7192861
#2 B 3.236218 1.6181089
#3 C 2.498058 0.8326859
talat
  • 68,970
  • 21
  • 126
  • 157
  • This is very interesting, I will have to investigate the dplyr package in more depth. Thanks for taking the time to give your response :) – Tjad Clark Jun 20 '14 at 18:01
1

This seems not so bad to me. The formation of the aggregate values as a matrix column of the result is actually convenient since one can refer to the jth aggregate as DF[[2]][j] rather than resorting to index arithmetic.

At any rate here is a function that will do it the way you want:

myag <- function(formula, data, FUN) {
   .FUN <- function(x) sapply(FUN, function(f) match.fun(f)(x))
   do.call("data.frame", c(aggregate(formula, data, .FUN), check.names = FALSE) )
}

# test it
ag <-    myag(mpg + wt ~ cyl, mtcars, c("mean", "sum"))

giving:

> ag
  cyl mpg + wt.mean mpg + wt.sum
1   4      28.94936      318.443
2   6      22.86000      160.020
3   8      19.09921      267.389
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you for your response, I was trying to avoid creating new functions as I was looking for something I could rely on everywhere - that I wouldn't have to recreate or maintain :) – Tjad Clark Jun 20 '14 at 18:00
1

Like this?

set.seed(1)
D <- data.frame(A=rpois(100,1),B=rpois(100,1),C=rep(1:10,each=10))
result <- aggregate(A+B ~ C, data = D, FUN=function(x) c(s=sum(x), m=mean(x)))
result <- data.frame(result[,1],result[,2])

There is indeed an extra step to get the 3-column data frame, but this does not require multiple aggregation or merge.

Incidentally, the problem you point out, regarding the way aggregate(...) deals with functions that return vectors, applies not just for formulas with an expression on the LHS.

result <- aggregate(A ~ C, data = D, FUN=function(x) c(s=sum(x), m=mean(x)))

returns a 2 column data frame, where each element in the second column contains a vector of length 2, but if you display result you are led to believe that result has three columns

head(result)
#   C  A.s  A.m
# 1 1 11.0  1.1
# 2 2 13.0  1.3
# 3 3  8.0  0.8
# 4 4 10.0  1.0
# 5 5 12.0  1.2
# 6 6  7.0  0.7
str(result)
# 'data.frame': 10 obs. of  2 variables:
#  $ C: int  1 2 3 4 5 6 7 8 9 10
#  $ A: num [1:10, 1:2] 11 13 8 10 12 7 9 13 6 12 ...
#   ..- attr(*, "dimnames")=List of 2
#   .. ..$ : NULL
#   .. ..$ : chr  "s" "m"
jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • If only I thought like you - simply. Your solution indeed does what I am looking for. I found a [solution](http://stackoverflow.com/questions/12064202/using-aggregate-for-multiple-aggregations) which I didn't investigate further - sorry about that- this lead to the slightly more generic version of your above solution. `as.data.frame(as.list(aggregate(A+B ~ C, data = D, FUN=function(x) c(s=sum(x), m=mean(x)))))` Thanks for your response, think I'll mark this as the solution as it best answers how I thought about this. But the dplyr answer below is very intriguing as well. – Tjad Clark Jun 20 '14 at 17:54
  • I think it's also useful to note - as seen in solution link from previous comment- that using the cbind rules on the left hand side allows one to rename the colums as follows. `cbind(my_name_instead_of_a_plus_b=A+B)` – Tjad Clark Jun 20 '14 at 18:07