How to aggregate duplicate rows with multiple columns in data frame

Question

I have a data.frame that looks like this (however with a larger number of columns and rows):

    Gene      Cell1    Cell2    Cell3     
1      A          2        7        8 
2      A          5        2        9 
3      B          2        7        8
4      C          1        4        3

I want to sum the rows that have the same value in Gene, in order to get something like this:

    Gene      Cell1    Cell2    Cell3     
1      A          7        9       17  
2      B          2        7        8
3      C          1        4        3

Based on the answers to previous questions, I've tried to use aggregate but I could not understand how I can get the above result. This is what I've tried:

aggregate(df[,-1], list(df[,1]), FUN = sum)

Does anyone have an idea of what I'm doing wrong?

what's wrong with the result you've got with aggregate? – Bea May 28 '17 at 17:56 — Bea, May 28 '17 at 17:56

score 6 · Accepted Answer · answered May 28 '17 at 18:08

6

aggregate(df[,-1], list(Gene=df[,1]), FUN = sum)
#   Gene Cell1 Cell2 Cell3
# 1    A     7     9    17
# 2    B     2     7     8
# 3    C     1     4     3

will give you the output you are looking for.

answered May 28 '17 at 18:08

lukeA

53,097
5
97
100

There's an error, when we run the above: `Error in aggregate.data.frame(df[, -1], list(Gene = df[, 1]), FUN = sum) : arguments must have same length` – Manoj Kumar May 28 '17 at 18:19
@ManojKumar Please add the output of `str(df)` to your post. – lukeA May 28 '17 at 18:23
Sure @lukeA here it is : `Classes ‘data.table’ and 'data.frame': 4 obs. of 4 variables: $ Gene : chr "A" "A" "B" "C" $ Cell1: int 2 5 2 1 $ Cell2: int 7 2 7 4 $ Cell3: int 8 9 8 3 - attr(*, ".internal.selfref")= ` – Manoj Kumar May 28 '17 at 18:26
2

@ManojKumar thx. You got a data table object; indexing is a bit different there. So you could e.g. do `aggregate(df[,-1], list(Gene=df[[1]]), FUN = sum)`. But if you got a data table anyway, you may want to use its strengths in aggregating data; `df[, lapply(.SD, sum), by=Gene]`. – lukeA May 28 '17 at 18:39

score 4 · Answer 2 · edited May 28 '17 at 21:17

4

Or with dplyr:

library(dplyr)
df %>%
  group_by(Gene) %>%
  summarise_all(sum) %>%
  data.frame() -> newdf # so that newdf can further be used, if needed

edited May 28 '17 at 21:17

Manoj Kumar

5,273
1
26
33

answered May 28 '17 at 18:21

jay.sf

60,139
8
53
110

1

the other methods work but this is more robust as well as intuitive. I like that one does not need to declare what columns to sum. – Ahdee May 26 '18 at 14:30

How to aggregate duplicate rows with multiple columns in data frame

2 Answers2