4

I have a df that basically looks like this:

Id  A    B    C total
3   5    0    1  6
3   4    3    4   11
3   2    1    2    5
4   5    4    3   12
4   3    2    4    9
4   1    1    1    3

I want to collapse the rows by Id and get:

Id   A    B    C    total
3    11   4    7     22
4    9    7    8   24

I was able to do so for one column with:

df.grouped<- aggregate(df$A~Id, data=df, FUN="sum")

I have many columns (A-Z), so I need some kind of loop. I tried:

df.grouped<- aggregate(df[4:51]~Id, data=df, FUN="sum")
names(df.grouped)<-paste(names(df)[4:51])

But got:

Error in model.frame.default(formula = df[4:51] ~ Id, data = df) : 
invalid type (list) for variable 'df[4:51]'

As you can see, I also want the names in df.grouped to be the same as in df.

Any ideas will be very helpful

Thanks

user3315563
  • 495
  • 2
  • 5
  • 10
  • Thanks! the first solution works very well. – user3315563 Oct 05 '15 at 06:36
  • 1
    If the answer worked for you, it would be appreciated if you accept the answer. This will give future readers a clue about the value of the solution. See also this help page: [What should I do when someone answers my question?](http://stackoverflow.com/help/someone-answers) – Jaap Oct 10 '15 at 11:09
  • @user3315563 To which answer are you referring? Why don't you follow Jaap's above advice? This is the best way to reward people who take time to answer to your questions! – Gilbert Apr 06 '18 at 07:11
  • The first one, with aggregate – user3315563 Apr 07 '18 at 07:30

3 Answers3

6

We can use the formula method of aggregate. By specifying . on the LHS of ~, we select all the columns except the 'Id' column.

aggregate(.~Id, df, sum)
#   Id  A B C total
#1  3 11 4 7    22
#2  4  9 7 8    24

Or we can also specify the columns without using the formula method

aggregate(df[2:ncol(df)],df['Id'], FUN=sum)
#  Id  A B C total
#1  3 11 4 7    22
#2  4  9 7 8    24

Other options include dplyr and data.table.

Using dplyr, we group by 'Id' and get the sum of all columns with summarise_each.

library(dplyr)
df %>%
  group_by(Id) %>%
  summarise_each(funs(sum))

Or with data.table, we convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Id', we loop (lapply(..) through the Subset of Data.table (.SD) and get the sum.

library(data.table)
setDT(df)[, lapply(.SD, sum), by = Id]
akrun
  • 874,273
  • 37
  • 540
  • 662
1

Use summaryBy from the doBy library:

library(doBy)
df.grouped = summaryBy(A + B + C + total ~ Id, data = df, FUN = sum,
    keep.names = TRUE)

You may omit FUN = sum as it is the default. When calculating length (count), mean, or variance, you can replace is with FUN = length or FUN = mean and so on.

Since you are also selecting every column of the data frame, you may use . ~ Id as a shorthand form.


An example that works right out of the box with R:

head(mtcars) library(doBy) summaryBy(. ~ cyl, data = mtcars, FUN = mean, keep.names = TRUE)

cyl      mpg     disp        hp     drat       wt     qsec        vs        am     gear     carb
1   4 26.66364 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
2   6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
3   8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000
Teng L
  • 297
  • 3
  • 17
1

Create df

        df <- data.frame(Id = c(3,3,3,4,4,4), 
                         A = c(5,4,2,5,3,1),   
                         B = c(0,3,1,4,2,1),   
                         C = c(1,4,2,3,4,1))

library(plyr)

req <- ddply( df, .(Id),
           summarise,
           A  = sum(A),
           B  = sum(B),
           C  = sum(C))

View(req)

Chaitu
  • 151
  • 2
  • 9