17

I have the following sample data.table:

dtb <- data.table(a=sample(1:100,100), b=sample(1:100,100), id=rep(1:10,10))

I would like to aggregate all columns (a and b, though they should be kept separate) by id using colSums, for example. What is the correct way to do this? The following does not work:

 dtb[,colSums, by="id"]

This is just a sample and my table has many columns so I want to avoid specifying all of them in the function name

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Alex
  • 19,533
  • 37
  • 126
  • 195

2 Answers2

30

this is actually what i was looking for and is mentioned in the FAQ:

dtb[,lapply(.SD,mean),by="id"]
Alex
  • 19,533
  • 37
  • 126
  • 195
  • 3
    +1 These, you are completely right, this is definitely the better way. I always think that I should have everything in long format, but quite often, as in this case, doing the computations is more efficient. You should mark yours as the correct answer. – Christoph_J Jul 30 '12 at 13:29
  • +1 Btw, this syntax has been optimized in the latest v1.8.2. There used to be a speed penalty of using `lapply` in `j` like that, but no longer. Revisions to `data.table` wiki points 1 have been made, and to point 5 are needed. – Matt Dowle Aug 08 '12 at 15:50
0

I guess in this case is it fastest to bring your data first into the long format and do your aggregation next (see Matthew's comments in this SO post):

library(data.table)
dtb <- data.table(a=sample(1:100,100), b=sample(1:100,100), id=rep(1:10,10))
library(reshape2)
dt_long <- as.data.table(melt(dtb, id.var="id"))
dt_long[, sum(value), by=c("id","variable")]
    id variable  V1
 1:  1        a 601
 2:  2        a 440
 3:  3        a 496
 4:  4        a 553
 5:  5        a 444
 6:  6        a 466
 7:  7        a 525
 8:  8        a 553
 9:  9        a 541
...
Community
  • 1
  • 1
Christoph_J
  • 6,804
  • 8
  • 44
  • 58
  • this seems pretty inefficient.. is there no way to just select id's once instead of once per variable? – Alex Jul 29 '12 at 04:10
  • I'm confused...What do you mean by inefficient? There is too much code to write or it's too slow? And what do you mean to just select id's once instead of once per variable? Didn't you want the sum for every variable and id combination? – Christoph_J Jul 29 '12 at 07:59
  • yes, that's right. inefficient i mean how many searches through the dataframe the code has to do. unless i am not understanding the basis of how R is doing things, with a vector operation, the id has to be looked up once and then the sum across columns is done as a vector operation. in the way you propose, (id, variable) has to be looked up every time. in my table i have about 200 columns so that will make a difference. no? – Alex Jul 29 '12 at 21:27