Sum a variable by group & create new column with frequency

Question

I have 2 columns of data. The first one is an id and the second one a value. There may be many occurrences of the same id. I need to aggregate the data by summing all the values for the same id AND I would like to create a new column with the number of occurrences of the same id.

For example:

I know I can use aggregate to sum the values and reduce the table to 4 rows, but I would like an extra column with the number of occurrences of the id like this:

id   value   freq
1     29      3
2      5      1
3     19      2
4     16      1

Thank you

akrun · Answer 1 · 2016-06-14T02:07:24.983

4

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', get the sum of 'value' and also the number of rows with (.N)

library(data.table)
setDT(df1)[, .(value=sum(value), freq = .N) , by = id]
#    id value freq
#1:  1    29    3
#2:  2     5    1
#3:  3    19    2
#4:  4    16    1

Or as @Frank commented

dcast(setDT(df1), id ~ ., fun = list(sum, length))

Or a similar approach with dplyr

library(dplyr)
df1 %>%
   group_by(id) %>%
   summarise(value = sum(value), freq = n())

edited Jun 14 '16 at 02:07

answered Jun 13 '16 at 15:41

akrun

874,273
37
540
662

1

I guess another data.table way is `dcast(DF, id ~ ., fun = list(sum, length))` – Frank Jun 13 '16 at 16:04

RHertel · Accepted Answer · 2016-06-14T11:02:00.493

3

Using base R, one can can combine aggregate() and table() like this:

cbind(aggregate(value ~ id, df1, sum), freq=as.vector(table(df1$id)))
#  id value freq
#1  1    29    3
#2  2     5    1
#3  3    19    2
#4  4    16    1

data used in this example:

df1 <- structure(list(id = c(1L, 1L, 2L, 3L, 1L, 3L, 4L), 
                 value = c(15L, 10L, 5L, 7L, 4L, 12L, 16L)), 
                 .Names = c("id", "value"), class = "data.frame", 
                 row.names = c(NA, -7L))

edited Jun 14 '16 at 11:02

answered Jun 13 '16 at 15:42

RHertel

23,412
5
38
64

2

You could also do `do.call(data.frame, aggregate(.~id, df1, function(x) c(sum(x), length(x))))` – akrun Jun 13 '16 at 15:46
1

Nice alternative, @akrun. Thank you. But I think in that case in addition I'd have to worry about the column names. – RHertel Jun 13 '16 at 15:48
Yes, you are right about that. – akrun Jun 13 '16 at 15:50
@akrun The `do.call` is not needed imo. Just using `aggregate(.~id, df1, function(x) c(sum(x), length(x)))` will give the same result. – Jaap Jun 13 '16 at 15:59
4

@ProcrastinatusMaximus, that leaves you with a messy structure (look at the str) – talat Jun 13 '16 at 16:03
1

thank you very much – Malagasy Desi Jun 13 '16 at 18:13
@DavidArenburg, that works for the example but it fails if `id` is, for example, a `character` and besides it creates a matrix which may result in undesired type coercion as you know – talat Jun 13 '16 at 20:23
I tried "df2<-aggregate(.~id, df1, function(x) c(sum(x), length(x)))" I get a dataframe with 2 columns only: id and the sum. If I type: df2<-aggregate(.~id, df1, function(x) c(length(x),sum(x))) then I get the length in column 2. What am I doing wrong? – Malagasy Desi Jun 14 '16 at 08:14
@MalagasyDesi Unfortunately, this post has been spoiled by many comments from other users. In my opinion, not all of the comments were helpful, and I found that, all things considered, the chat in its length was inappropriate and confusing. I'm sorry that you seem to have gotten confused by this, too. Please try the solution **that I have posted** and tell me if everything works. It works in any case for the example you posted. – RHertel Jun 14 '16 at 11:56
Just to avoid misunderstandings: The solution in the answer by @akrun is equally fine as far as I can tell (and presumably some of the answers in the comments, too). I did not want to insinuate that mine was better. But to avoid further confusion I prefer not to comment on various different answers that other users posted below my answer. These users could have posted their own answers separately, where such questions could have been addressed more easily – RHertel Jun 14 '16 at 11:56
@RHertel I am actually using your solution which I like. I just tried the other one out of curiosity but I do not seem to get the expected result, so I was wondering why – Malagasy Desi Jun 14 '16 at 12:51
@MalagasyDesi I understand, and I'm glad to hear that my answer works for you. Sorry for the misunderstanding. The other solution that you mentioned works in my case. Although the data structure is quite messy and the column names are wrong, the content of the output is correct. A common problem in similar situations is the class of the data. You may want to check the output of `sapply(df1,class)` and see if there is any irregularity. Apart from that, I'm afraid that I can't help. – RHertel Jun 14 '16 at 13:00
*sigh* Are you for real flagged my comment which offered an alternative solution? For real? Ok, I won't comment under your answers anymore. You could just ping me directly. This is pathetic. And this supercilious language of "*this post has been spoiled by many comments from other users*"- pathetic. – David Arenburg Jun 14 '16 at 17:17

Sum a variable by group & create new column with frequency

2 Answers2