7

I would like to aggregate a data.frame by an identifier variable called ensg. The data frame looks like this:

  chromosome probeset               ensg symbol    XXA_00    XXA_36    XXB_00
1          X  4938842 ENSMUSG00000000003   Pbsn  4.796123  4.737717  5.326664

I want to compute the mean for each numeric column over rows with same ensg value. The problem here is that I would like to leave the other identity variables chromosome and symbol untouched as they are also the same for same ensg.

In the end I would like to have a data.frame with identity columns chromosome, ensg, symbol and mean of numeric columns over rows with same identifier. I implemented this in ddply, but it is very slow when compared to aggregate:

spec.mean <- function(eset.piece)
  {
    cbind(eset.piece[1,-numeric.columns],t(colMeans(eset.piece[,numeric.columns])))
  }
t
mean.eset <- ddply(eset.consensus.grand,.(ensg),spec.mean,.progress="tk")

My first aggregate implementation looks like this,

mean.eset=aggregate(eset[,numeric.columns], by=list(eset$ensg), FUN=mean, na.rm=TRUE);

and is much faster. But the problem with aggregate is that I have to reattach the describing variables. I have not figured out how to use my custom function with aggregate since aggregate does not pass data frames but only vectors.

Is there an elegant way to do this with aggregate? Or is there some faster way to do it with ddply?

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Johannes
  • 142
  • 5

2 Answers2

8

If speed is a primary concern, you should take a look at the data.table package. When the number of rows or grouping columns is large, data.table really seems to shine. The wiki for the package is here and has several links to other good introductory documents.

Here's how you'd do this aggregation with data.table()

library(data.table)
#Turn the data.frame above into a data.table
dt <- data.table(df)
#Aggregation

  dt[, list(XXA_00 = .Internal(mean(XXA_00)),
          XXA_36 = .Internal(mean(XXA_36)),
          XXB_00 = .Internal(mean(XXB_00))),
    by = c("ensg", "chromosome", "symbol")
   ]

Gives us

     ensg chromosome symbol      XXA_00      XXA_36    XXB_00
[1,]   E1          A     S1  0.18026869  0.13118997 0.6558433
[2,]   E2          B     S2 -0.48830539  0.24235537 0.5971377
[3,]   E3          C     S3 -0.04786984 -0.03139901 0.5618208

The aggregate solution provided above seems to fare pretty well when working with the 30 row data.frame by comparing the output from the rbenchmark package. However, when the data.frame contains 3e5 rows, data.table() pulls away as a clear winner. Here's the output:

 benchmark(fag(), fdt(), replications = 10)
   test replications elapsed relative user.self sys.self 
1 fag()           10   12.71 23.98113     12.40     0.31     
2 fdt()           10    0.53  1.00000      0.48     0.05         
Chase
  • 67,710
  • 18
  • 144
  • 161
  • Not a fan of encouraging users to call functions via `.Internal` interactively and I doubt it makes that much of a difference in this case. – Joshua Ulrich Dec 13 '11 at 15:22
  • 1
    @JoshuaUlrich - I've never been bitten by `.Internal` so can't really say one way or another. The wiki page for `data.table` shows a nearly 10x improvement in speed for `.Internal(mean())` over `mean()` in their test case. That's where I picked up that (possibly bad) habit. I'll update the test using standard `mean()` and see what actually happens though. – Chase Dec 13 '11 at 15:27
  • 2
    You could use `mean.default` instead of `mean` to avoid the method dispatch and that would save you a little time. I don't mind showing `.Internal` as a _option_ as long as it's followed with a "here be dragons" warning. – Joshua Ulrich Dec 13 '11 at 15:52
  • @JoshuaUlrich -- Using `mean.default` is a nice suggestion. I just tried it out with the example on the wiki page (http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table#don_t_use_mean_x_use_.internal_mean_x), and `mean.default()` buys a 4X speedup relative to `mean()`. (`.Internal(mean())` achieves an 8X speedup.) – Josh O'Brien Dec 13 '11 at 16:46
  • This one could do with an edit. No need to worry about `.Internal` anymore, as @JoshuaUlrich was quite right about. As of a few versions ago `mean` now gets optimized automatically. Should it be an edit tacked on the end, or just change the answer and benchmark again? Not sure. – Matt Dowle Nov 28 '12 at 17:17
7

First let's define a toy example:

df <- data.frame(chromosome = gl(3,  10,  labels = c('A',  'B',  'C')),
             probeset = gl(3,  10,  labels = c('X',  'Y',  'Z')),
             ensg =  gl(3,  10,  labels = c('E1',  'E2',  'E3')),
             symbol = gl(3,  10,  labels = c('S1',  'S2',  'S3')),
             XXA_00 = rnorm(30),
             XXA_36 = rnorm(30),
             XXB_00 = rnorm(30))

And then we use aggregate with the formula interface:

df1 <- aggregate(cbind(XXA_00, XXA_36, XXB_00) ~ ensg + chromosome + symbol,  
    data = df,  FUN = mean)

> df1
  ensg chromosome symbol      XXA_00      XXA_36      XXB_00
1   E1          A     S1 -0.02533499 -0.06150447 -0.01234508
2   E2          B     S2 -0.25165987  0.02494902 -0.01116426
3   E3          C     S3  0.09454154 -0.48468517 -0.25644569
Oscar Perpiñán
  • 4,491
  • 17
  • 28
  • 1
    Thanks for this beautiful solution! It is also lightning fast compared to ddply. I didn't know about the power of the formula interface! I have been using R for a couple of months now, but it keeps surprising me (in a good way). Embarrasingly, I still lack the reputation for voting your solution up... – Johannes Dec 13 '11 at 12:52
  • 3
    @Johannes: FWIW, the formula interface is slower than the others. – Joshua Ulrich Dec 13 '11 at 12:53
  • 1
    I have now adapted the solution posted [here](http://stackoverflow.com/a/3685919/1093951) by hadley. It relies on converting the data to a matrix before applying the mean leading to a huge speed increase. With my data set of dimension 10^5 x 43 the speed increases by a factor of 25! – Johannes Dec 13 '11 at 22:50