Calculate column sums for each combination of two grouping variables

Question

I have a dataset that looks something like this:

 Type Age   count1  count2  Year   Pop1   Pop2  TypeDescrip
  A   35    1        1      1990   30000  50000  alpha                                 
  A   35    3        1      1990   30000  50000  alpha 
  A   45    2        3      1990   20000  70000  alpha 
  B   45    2        1      1990   20000  70000  beta
  B   45    4        5      1990   20000  70000  beta

I want to add the counts of the rows that are matching in the Type and Age columns. So ideally I would end up with a dataset that looks like this:

 Type  Age  count1  count2  Year   Pop1   Pop2  TypeDescrip 
  A   35    4        2      1990   30000  50000  alpha 
  A   45    2        3      1990   20000  70000  alpha 
  B   45    6        6      1990   20000  70000  beta

I've tried using nested duplicated() statements such as below:

typedup = duplicated(df$Type)
bothdup = duplicated(df[(typedup == TRUE),]$Age)

but this returns indices for which age or type are duplicated, not necessarily when one row has duplicates of both.

I've also tried tapply:

tapply(c(df$count1, df$count2), c(df$Age, df$Type), sum)

but this output is difficult to work with. I want to have a data.frame when I'm done.

I don't want to use a for-loop because my dataset is quite large.

If you have many columns to group by and sum, see http://stackoverflow.com/questions/30669817/r-aggregate-by-large-number-of-columns/30670135 — Sam Firke, Jul 02 '15 at 18:00
@Frank I thought there _must_ be a duplicate of this question - but I didn't find any perfect matches. This question has two grouping variables which makes it different from the one you linked. — Sam Firke, Jul 02 '15 at 18:06
@SamFirke Not different enough for my tastes :) The important difference, anyway, is not the two grouping variables, but the two columns being summed. If there were only one, the OP's `tapply` would almost "work" (in the sense of at least giving the right numbers, though not in a data.frame). — Frank, Jul 02 '15 at 18:08
I think your TypeDescrip would be beta for the 2nd row in the expected output. Try `df2 %>% group_by(Type, Age,Pop1, Pop2, TypeDescrip) %>% summarise_each(funs(sum), matches('^count'))` — akrun, Jul 02 '15 at 20:04

akrun · Accepted Answer · 2020-03-10T18:06:14.977

Try

library(dplyr)
df1 %>%
     group_by(Type, Age) %>% 
     summarise_each(funs(sum))
#    Type Age count1 count2
#1    A  35      4      2
#2    A  45      2      3
#3    B  45      6      6

In the newer versions of dplyr

df1 %>%
     group_by(Type, Age) %>%
     summarise_all(sum)

Or using base R

 aggregate(.~Type+Age, df1, FUN=sum)
 #    Type Age count1 count2
 #1    A  35      4      2
 #2    A  45      2      3
 #3    B  45      6      6

Or

library(data.table)
setDT(df1)[, lapply(.SD, sum), .(Type, Age)] 
#   Type Age count1 count2
#1:    A  35      4      2
#2:    A  45      2      3
#3:    B  45      6      6

Update

Based on the new dataset,

 df2 %>%
     group_by(Type, Age,Pop1, Pop2, TypeDescrip) %>% 
     summarise_each(funs(sum), matches('^count'))
 #    Type Age  Pop1  Pop2 TypeDescrip count1 count2
 #1    A  35 30000 50000       alpha      4      2
 #2    A  45 20000 70000        beta      2      3
 #3    B  45 20000 70000        beta      6      6

data

 df1 <- structure(list(Type = c("A", "A", "A", "B", "B"), Age = c(35L, 
 35L, 45L, 45L, 45L), count1 = c(1L, 3L, 2L, 2L, 4L), count2 = c(1L, 
 1L, 3L, 1L, 5L)), .Names = c("Type", "Age", "count1", "count2"
 ), class = "data.frame", row.names = c(NA, -5L))

 df2 <- structure(list(Type = c("A", "A", "A", "B", "B"), Age = c(35L, 
 35L, 45L, 45L, 45L), count1 = c(1L, 3L, 2L, 2L, 4L), count2 = c(1L, 
 1L, 3L, 1L, 5L), Year = c(1990L, 1990L, 1990L, 1990L, 1990L), 
   Pop1 = c(30000L, 30000L, 20000L, 20000L, 20000L), Pop2 = c(50000L, 
   50000L, 70000L, 70000L, 70000L), TypeDescrip = c("alpha", 
   "alpha", "beta", "beta", "beta")), .Names = c("Type", "Age", 
  "count1", "count2", "Year", "Pop1", "Pop2", "TypeDescrip"),
   class =   "data.frame", row.names = c(NA, -5L))

I like the group_by solution you provided, but is there a way to include more columns in the output? my dataset is wider than the example I gave in my original post. — heo, Jul 02 '15 at 19:48
@Hannah Can you update your post with an example that mimics your original data? I guess you have columns other than the one you wanted to get the sum. But, if you you want to keep those columns in the summary, which values would you select. i.e. the last one, first one etc.. — akrun, Jul 02 '15 at 19:51
In the first code snipped using `funs` is deprecated., use `list` instead — Octopus, Mar 10 '20 at 10:09

score 1 · Answer 2 · edited Jul 02 '15 at 17:51

1

@hannah you can also use sql using the sqldf package

sqldf("select 
Type,Age,
sum(count1) as sum_count1, 
sum(count2) as sum_count2 
from 
 df 
group by 
Type,Age
")

edited Jul 02 '15 at 17:51

akrun

874,273
37
540
662

answered Jul 02 '15 at 17:40

Ajay Ohri

3,382
3
30
60

Calculate column sums for each combination of two grouping variables

2 Answers2

Update

data