fast way to calculate mean with large dataset

Question

I have very large dataset like following:

myd <- data.frame (id = paste("id_",rep(1:500000, each = 3), sep=""),
      yvar= rep(1:500000, each= 3), xvar= rep(1:500000, each= 3))

I would like to calculate mean for ids. I am trying the following it is taking long time.

myd1 <- aggregate(myd, list(myd$id), mean)

Any quicker to do this ?

Did you read the answer below? It's a bajillion times faster than yours (which has a typo) on my computer. I guess yours should be `aggregate(myd[, -1], list(myd$id), mean)` — Frank, Apr 11 '17 at 14:49
thank you for reading the question carefully and answered it. — jon, Apr 11 '17 at 14:52
For a comprehensive speed comparison, see Ari Friedman's answer to the "Average data by group" question linked above. — Frank, Apr 11 '17 at 15:01

Erdem Akkas · Accepted Answer · 2017-04-11T14:55:42.853

3

With data.table:

library(data.table)
setDT(myd)
myd[,.(mean(yvar),mean(xvar)),by=(id)]

Performance comparison as follows:

system.time(myd1 <-aggregate(myd[, -1], list(myd$id), mean)) 
user  system elapsed 
19.56    0.08   19.72 

system.time(mydt1<-mydt[,.(mean(yvar),mean(xvar)),by=(id)])
user  system elapsed 
0.07    0.00    0.06

edited Apr 11 '17 at 14:55

answered Apr 11 '17 at 14:33

Erdem Akkas

2,062
10
15

fast way to calculate mean with large dataset

1 Answers1