6

I have a dataset that looks something like this:

 Type Age   count1  count2  Year   Pop1   Pop2  TypeDescrip
  A   35    1        1      1990   30000  50000  alpha                                 
  A   35    3        1      1990   30000  50000  alpha 
  A   45    2        3      1990   20000  70000  alpha 
  B   45    2        1      1990   20000  70000  beta
  B   45    4        5      1990   20000  70000  beta 

I want to add the counts of the rows that are matching in the Type and Age columns. So ideally I would end up with a dataset that looks like this:

 Type  Age  count1  count2  Year   Pop1   Pop2  TypeDescrip 
  A   35    4        2      1990   30000  50000  alpha 
  A   45    2        3      1990   20000  70000  alpha 
  B   45    6        6      1990   20000  70000  beta 

I've tried using nested duplicated() statements such as below:

typedup = duplicated(df$Type)
bothdup = duplicated(df[(typedup == TRUE),]$Age)

but this returns indices for which age or type are duplicated, not necessarily when one row has duplicates of both.

I've also tried tapply:

tapply(c(df$count1, df$count2), c(df$Age, df$Type), sum)

but this output is difficult to work with. I want to have a data.frame when I'm done.

I don't want to use a for-loop because my dataset is quite large.

heo
  • 141
  • 5
  • 1
    If you have many columns to group by and sum, see http://stackoverflow.com/questions/30669817/r-aggregate-by-large-number-of-columns/30670135 – Sam Firke Jul 02 '15 at 18:00
  • @Frank I thought there _must_ be a duplicate of this question - but I didn't find any perfect matches. This question has two grouping variables which makes it different from the one you linked. – Sam Firke Jul 02 '15 at 18:06
  • @SamFirke Not different enough for my tastes :) The important difference, anyway, is not the two grouping variables, but the two columns being summed. If there were only one, the OP's `tapply` would almost "work" (in the sense of at least giving the right numbers, though not in a data.frame). – Frank Jul 02 '15 at 18:08
  • I think your TypeDescrip would be beta for the 2nd row in the expected output. Try `df2 %>% group_by(Type, Age,Pop1, Pop2, TypeDescrip) %>% summarise_each(funs(sum), matches('^count'))` – akrun Jul 02 '15 at 20:04

2 Answers2

9

Try

library(dplyr)
df1 %>%
     group_by(Type, Age) %>% 
     summarise_each(funs(sum))
#    Type Age count1 count2
#1    A  35      4      2
#2    A  45      2      3
#3    B  45      6      6

In the newer versions of dplyr

df1 %>%
     group_by(Type, Age) %>%
     summarise_all(sum)

Or using base R

 aggregate(.~Type+Age, df1, FUN=sum)
 #    Type Age count1 count2
 #1    A  35      4      2
 #2    A  45      2      3
 #3    B  45      6      6

Or

library(data.table)
setDT(df1)[, lapply(.SD, sum), .(Type, Age)] 
#   Type Age count1 count2
#1:    A  35      4      2
#2:    A  45      2      3
#3:    B  45      6      6

Update

Based on the new dataset,

 df2 %>%
     group_by(Type, Age,Pop1, Pop2, TypeDescrip) %>% 
     summarise_each(funs(sum), matches('^count'))
 #    Type Age  Pop1  Pop2 TypeDescrip count1 count2
 #1    A  35 30000 50000       alpha      4      2
 #2    A  45 20000 70000        beta      2      3
 #3    B  45 20000 70000        beta      6      6

data

 df1 <- structure(list(Type = c("A", "A", "A", "B", "B"), Age = c(35L, 
 35L, 45L, 45L, 45L), count1 = c(1L, 3L, 2L, 2L, 4L), count2 = c(1L, 
 1L, 3L, 1L, 5L)), .Names = c("Type", "Age", "count1", "count2"
 ), class = "data.frame", row.names = c(NA, -5L))

 df2 <- structure(list(Type = c("A", "A", "A", "B", "B"), Age = c(35L, 
 35L, 45L, 45L, 45L), count1 = c(1L, 3L, 2L, 2L, 4L), count2 = c(1L, 
 1L, 3L, 1L, 5L), Year = c(1990L, 1990L, 1990L, 1990L, 1990L), 
   Pop1 = c(30000L, 30000L, 20000L, 20000L, 20000L), Pop2 = c(50000L, 
   50000L, 70000L, 70000L, 70000L), TypeDescrip = c("alpha", 
   "alpha", "beta", "beta", "beta")), .Names = c("Type", "Age", 
  "count1", "count2", "Year", "Pop1", "Pop2", "TypeDescrip"),
   class =   "data.frame", row.names = c(NA, -5L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I like the group_by solution you provided, but is there a way to include more columns in the output? my dataset is wider than the example I gave in my original post. – heo Jul 02 '15 at 19:48
  • 1
    @Hannah Can you update your post with an example that mimics your original data? I guess you have columns other than the one you wanted to get the sum. But, if you you want to keep those columns in the summary, which values would you select. i.e. the last one, first one etc.. – akrun Jul 02 '15 at 19:51
  • 1
    @Hannah Updated the post with the new data – akrun Jul 02 '15 at 20:06
  • 1
    In the first code snipped using `funs` is deprecated., use `list` instead – Octopus Mar 10 '20 at 10:09
1

@hannah you can also use sql using the sqldf package

sqldf("select 
Type,Age,
sum(count1) as sum_count1, 
sum(count2) as sum_count2 
from 
 df 
group by 
Type,Age
")
akrun
  • 874,273
  • 37
  • 540
  • 662
Ajay Ohri
  • 3,382
  • 3
  • 30
  • 60