faster way to create variable that aggregates a column by id

Question

Is there a faster way to do this? I guess this is unnecessary slow and that a task like this can be accomplished with base functions.

df <- ddply(df, "id", function(x) cbind(x, perc.total = sum(x$cand.perc)))

I'm quite new to R. I have looked at by(), aggregate() and tapply(), but didn't get them to work at all or in the way I wanted. Rather than returning a shorter vector, I want to attach the sum to the original dataframe. What is the best way to do this?

Edit: Here is a speed comparison of the answers applied to my data.

> # My original solution
> system.time( ddply(df, "id", function(x) cbind(x, perc.total = sum(x$cand.perc))) )
   user  system elapsed 
 14.405   0.000  14.479 

> # Paul Hiemstra
> system.time( ddply(df, "id", transform, perc.total = sum(cand.perc)) )
   user  system elapsed 
 15.973   0.000  15.992 

> # Richie Cotton
> system.time( with(df, tapply(df$cand.perc, df$id, sum))[df$id] )
   user  system elapsed 
  0.048   0.000   0.048 

> # John
> system.time( with(df, ave(cand.perc, id, FUN = sum)) )
       user  system elapsed 
      0.032   0.000   0.030 

> # Christoph_J
> system.time( df[ , list(perc.total = sum(cand.perc)), by="id"][df])
   user  system elapsed 
  0.028   0.000   0.028

I wonder what would this do (may be just a shot into the dark): `aggregate(cand.perc ~ id, data = df, FUN = sum)`? Alternatively you could do `tapply(df$cand.perc, INDEX = df$id, FUN = sum)`. For further use of apply family of functions, see http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega — Roman Luštrik, Nov 22 '11 at 11:21
@RomanLuštrik These will be of a lower dimension to the original data, but you could use `merge` to get what the OP wants with these — James, Nov 22 '11 at 11:34
I somehow missed the "attack the sum to the original df". Thanks! — Roman Luštrik, Nov 22 '11 at 12:10
OK, it's hard to tell exactly what you want. If you do any kind of aggregating you're going to make the new vector shorter. So, I'm guessing you're asking for one of two things. One might be the sums of each row, appending that. The other might be that you want to get the aggregate by id but maintain replicates so that it's still the proper length to be a new column in df. Is either of these correct? Please clarify the questions so that someone can tell. — John, Nov 22 '11 at 12:25
Im trying to do the latter. Aggregate() or by() return vectors of shorter length, but I want to have the sum by id listed for each observation. — ilprincipe, Nov 22 '11 at 12:33
I think it's somewhat unfair to compare to plyr to other aggregation solutions - the aim of plyr is not to make computation fast, but to make thinking about aggregations faster and easier. The main advantage of plyr is that has a uniform syntax for just about every type of manipulation you can think of it, so you don't have to learn many special cases. Hopefully a future version will also be fast. — hadley, Nov 22 '11 at 15:41
I agree. But I think in this relatively simple case the solution I worked up with plyr is also more complicated, whereas John's solution is perfectly simple. Admittedly, the syntax of Paul's solution is also simple, but it felt somewhat strange to me to pass the whole data frame to ddlpy, when I had a strict idea which variables should be processed. — ilprincipe, Nov 22 '11 at 17:39

score 12 · Answer 1 · edited Nov 02 '12 at 11:32

12

Since you are quite new to R and speed is apparently an issue for you, I recommend the data.table package, which is really fast. One way to solve your problem in one line is as follows:

library(data.table)
DT <- data.table(ID = rep(c(1:3), each=3),
                 cand.perc = 1:9,
                 key="ID")
DT <- DT[ , perc.total := sum(cand.perc), by = ID]
DT
      ID Perc.total cand.perc
 [1,]  1          6         1
 [2,]  1          6         2
 [3,]  1          6         3
 [4,]  2         15         4
 [5,]  2         15         5
 [6,]  2         15         6
 [7,]  3         24         7
 [8,]  3         24         8
 [9,]  3         24         9

Disclaimer: I'm not a data.table expert (yet ;-), so there might faster ways to do that. Check out the package site to get you started if you are interested in using the package: http://datatable.r-forge.r-project.org/

edited Nov 02 '12 at 11:32

mnel

113,303
27
265
254

answered Nov 22 '11 at 13:03

Christoph_J

6,804
8
44
58

this is interesting. speed is not so much an issue yet, but the plyr versions seemed incredibly slow. However, apllying your solution gives me the following error: `Error in \`[.data.table\`(df[, list(perc.total = sum(cand.perc)), by = "id"], : When i is a data.table, x must be sorted to avoid a vector scan of x per row of i` – ilprincipe Nov 22 '11 at 13:30
@ilprincipe Well, even if speed isn't such an issue, I would still recommend that you have a look at `data.table`: while the syntax was a little bit confusing for me at the beginning (I was used to `data.frames`), I now think it's elegant and easy to understand. I almost completely replaced `data.frames` with `data.tables` (and a `data.table` is just a `data.frame`, so you can use all the base R functions as well on a `data.table`). And since you are just starting, you might find it even easier to work with from the beginning. – Christoph_J Nov 22 '11 at 13:34
1

@Christoph_J I didn't think of doing it like that. Very nice. There might be faster ways if `DT` has very many columns; e.g., calculate aggregate, `rep()` by `.N`, then add that column by reference using `:=`. Or, just wait until `:=` by group is implemented. – Matt Dowle Nov 22 '11 at 13:35
@ilprincipe Re the error, you need to set a key on `df`. Like Christoph did it with `key=` or see `?setkey`. – Matt Dowle Nov 22 '11 at 13:38
@MatthewDowle Thanks for letting me know that this is indeed a good way of doing it. I'm getting there ;-) Anyways, I'm looking forward for the `:=` by group feature to be implemented. This will make it even easier. – Christoph_J Nov 22 '11 at 13:40
@Christoph_J If you know C you're very welcome to help implement :-) Btw, I would have suggested `j=cbind(.SD,agg=sum(...))` but yours should be faster because it avoids `.SD`. – Matt Dowle Nov 22 '11 at 13:51
thanks, it works now. I'll have a look at data.table. Your solution is slightly faster (have added it to the op), but I have to accept John's answer as it uses the base functions. – ilprincipe Nov 22 '11 at 13:52
@ilprincipe Ok, good. You should find that as your dataset _scales_ into millions or billions of rows, the speed difference will become more apparent. Or if you need to return more than one aggregate per group, or combine together different columns in the aggregate, things like that are a little easier to write and read. – Matt Dowle Nov 22 '11 at 14:06
@ilprincipe The error message you showed in comments above has been improved in data.table v1.7.3 (now on CRAN) and you're thanked in NEWS. – Matt Dowle Nov 25 '11 at 14:14

John · Accepted Answer · 2011-11-22T14:04:40.607

6

For any kind of aggregation where you want a resulting vector the same length as the input vector with replicates grouped across the grouping vector ave is what you want.

df$perc.total <- ave(df$cand.perc, df$id, FUN = sum)

edited Nov 22 '11 at 14:04

answered Nov 22 '11 at 12:18

John

23,360
7
57
83

jup. this was exactly what I wanted (I had also looked at colSums etc), but I wasn't aware that ave could take functions other than mean. this is also the fastest solution. I'll add a speed comparison to my post. – ilprincipe Nov 22 '11 at 13:25
what is the difference between `ave(df$cand.perc, df$id, FUN = sum)` (the solution you posted first) and `with(df, ave(cand.perc, id, FUN = sum))`? the effect is exactly the same if I'm not mistaken. why should `with()` be used? – ilprincipe Nov 22 '11 at 13:37
`with` is handy when you don't want to keep writing the name of the data.frame. When the name is short and it's only twice I don't bother with it. It has other features as well, but in this case the results would be identical... I've rolled it back to the answer I wrote. – John Nov 22 '11 at 14:03

Richie Cotton · Answer 3 · 2011-11-22T11:34:04.093

3

Use tapply to get the group stats, then add them back into your dataset afterwards.

Reproducible example:

means_by_wool <- with(warpbreaks, tapply(breaks, wool, mean))
warpbreaks$means.by.wool <- means_by_wool[warpbreaks$wool]

Untested solution for your scenario:

sum_by_id <- with(df, tapply(cand.perc, id, sum))
df$perc.total <- sum_by_id[df$id]

edited Nov 22 '11 at 11:34

answered Nov 22 '11 at 11:28

Richie Cotton

118,240
47
247
360

You're welcome, though John's solution is better. Remember that you can upvote all the answers that are useful. – Richie Cotton Nov 22 '11 at 13:19

Paul Hiemstra · Answer 4 · 2011-11-22T21:58:57.460

0

Why are you using cbind(x, ...) the output of ddply will be append automatically. This should work:

ddply(df, "id", transform, perc.total = sum(cand.perc))

getting rid of the superfluous cbind should speed things up.

edited Nov 22 '11 at 21:58

answered Nov 22 '11 at 11:32

Paul Hiemstra

59,984
12
142
149

this does not append to the original dataset but returns a shorter dataset than the original one (unique in id) with only id and cand.perc as variables. – ilprincipe Nov 22 '11 at 12:39
Try transform instead of summarize. – Chase Nov 22 '11 at 12:43
works, but just as slow. – ilprincipe Nov 22 '11 at 13:07
@ilprincipe - if speed is a main concern, heed the advice above re: data.table, or look into setting up an appropriate backend and adding the `parallel = TRUE` parameter to plyr. Providing some more details about the size and scope of your problem would let people give better answers, though I have to imagine one of the five above should suffice for almost any question. – Chase Nov 22 '11 at 13:26
For tips an parallel processing with plyr see the following post on my blog: http://www.numbertheory.nl/2011/11/14/parallelization-using-plyr-loading-objects-and-packages-into-worker-nodes/ – Paul Hiemstra Nov 22 '11 at 21:56
Edited my answer to include transform – Paul Hiemstra Nov 22 '11 at 21:59

score 0 · Answer 5 · answered Nov 22 '11 at 12:50

0

ilprincipe if none of the above fits your needs you could try transposing your data

dft=t(df)

then use aggregate

dfta=aggregate(dft,by=list(rownames(dft)),FUN=sum)

next have back your rownames

rownames(dfta)=dfta[,1]
dfta=dfta[,2:ncol(dfta)]

Transpose back to original orientation

df2=t(dfta)

and bind to original data

newdf=cbind(df,df2)

answered Nov 22 '11 at 12:50

boczniak767

315
3
10

Actually, one reason why I am ditching stata is just this. Having to transpose your data to do tasks like this is tedious and unnecessary. – ilprincipe Nov 22 '11 at 13:06

score 0 · Answer 6 · answered Nov 23 '11 at 14:54

0

You can also load up your favorite foreach backend and try the .parallel=TRUE argument for ddply.

answered Nov 23 '11 at 14:54

Zach

29,791
35
142
201

faster way to create variable that aggregates a column by id

6 Answers6

Linked

Related