Merge and aggregate result to original data

Question

I have a dataframe df, with 3 variables : id, category and quantity :

id category quantity
01 AB       235
02 BC       987
03 AB       366
04 CD       287

I want to add a fourth variable which is the sum of the whole category. For now, I do that like it :

df <- merge(df,aggregate(df$quantity,list(df$category),sum),
     by.x="category", 
     by.y="Group.1")
names(df)[4] <- "sum.category"

It works, but I don't find it very satisfying, there's probably a better way?

@Pascal : done, sorry for the inconvenience – Malta Jan 21 '16 at 06:07 — Malta, Jan 21 '16 at 06:07

score 4 · Accepted Answer · answered Jan 21 '16 at 06:11

Here is another option with data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'category', we assign (:=) the sum of 'quantity' as the new column ('sum.category').

library(data.table)
setDT(df1)[,sum.category:= sum(quantity) , category]
df1
#    id category quantity sum.category
#1:  1       AB      235          601
#2:  2       BC      987          987
#3:  3       AB      366          601
#4:  4       CD      287          287

Or using base R

df1$sum.category <- with(df1, ave(quantity, category, FUN=sum))

score 3 · Answer 2 · answered Jan 21 '16 at 05:16

You can use tapply to get the sums, then us a lookup table to create the new column

# use tapply to get the sums.  using with() makes the code nicer, IMO.
cat_sums <- with(df, tapply(quantity, category, sum))
# use lookup table to create new column
df$sum.category <- cat_sums[df$category]
#  id category quantity sum.category
#1  1       AB      235          601
#2  2       BC      987          987
#3  3       AB      366          601
#4  4       CD      287          287

score 3 · Answer 3 · edited Jan 21 '16 at 09:32

Here is a dplyr solution

df %>%
    group_by(category)                   %>% # Group by category
    mutate(sum.category = sum(quantity)) %>% # Sum by category
    ungroup                                  # Remove grouping
#Source: local data frame [4 x 4]
#
#     id category quantity sum.category
#  (int)    (chr)    (int)        (int)
#1     1       AB      235          601
#2     2       BC      987          987
#3     3       AB      366          601
#4     4       CD      287          287

The ungroup isn't absolutely necessary.

Bala · Answer 4 · 2016-01-21T09:16:42.977

You could use the same merge and aggregate in a little more readable way. Merging the actual df with result of aggregate on left outer join all.x = TRUE will do the job. I hope this is better to understand.

df <- data.frame(id=c(01,02,03,04),category=c("AB","BC","AB","CD"),
                 quantity=c(235,987,366,287))

df <- merge(df,aggregate(quantity ~ category, data=df, sum),"category",all.x = TRUE)
names(df)[4] <- "sum.category"
df

#   category  id     quantity.x   sum.category
#         AB  1        235        601
#         AB  3        366        601
#         BC  2        987        987
#         CD  4        287        287

If you still want a more easy-to-understand way, then sql is the best option. For this you might need sqldf library. We are doing the same aggregate and merge back to the actual df in an sql way. Its more like a self-join thing. And the sql codes are quite easier to understand

library (sqldf)
dfnew<-sqldf("select a.*,b.sum_quantity
             from df a left join 
             (select category, sum(quantity) sum_category
             from df group by 1) b
             on a.category=b.category")

dfnew

# category id quantity sum_category
#       AB  1      235          601
#       BC  2      987          987
#       AB  3      366          601
#       CD  4      287          287

Merge and aggregate result to original data

4 Answers4