Collapse data frame, adding values for factor

Question

Possible Duplicate:
Aggregate R sum

I have a data frame that looks like this:

  sample value
1      a     1
2      a     2
3      b     3
4      b     4
5      b     5
6      c     6

I need to collapse the data frame over the levels of sample (may be a character vector or a factor), adding all the values, so my resulting data frame looks like this:

  sample sum
1      a   3
2      b  12
3      c   6

It's Monday morning, and all I can think of is writing a complicated for loop. How might I vectorize this using apply, plyr, etc?

Roland · Accepted Answer · 2012-11-26T15:03:51.403

7

If you don't want to load a package:

df <- read.table(text="  sample value
1      a     1
2      a     2
3      b     3
4      b     4
5      b     5
6      c     6",header=TRUE)

aggregate(df$value,by=list(sample=df$sample),sum)

  sample  x
1      a  3
2      b 12
3      c  6

Or if you prefer formula syntax:

aggregate(value ~ sample, df, sum)

  sample value
1      a     3
2      b    12
3      c     6

edited Nov 26 '12 at 15:03

answered Nov 26 '12 at 14:54

Roland

127,288
10
191
288

score 4 · Answer 2 · edited Nov 26 '12 at 15:27

4

I like cast, for these types of problems because it's quick and intuitive:

library(reshape2)
dcast(your_df, sample ~ ., sum) # or just cast with the original reshape

I also like summarize, for this type of question.

library(plyr) 
ddply(df,.(sample),summarise, sum=sum(value))

edited Nov 26 '12 at 15:27

Community

1
1

answered Nov 26 '12 at 14:52

Brandon Bertelsen

43,807
34
160
255

Thanks Brandon. Upvoted because this works on my simple example, but when my data frame has many other columns, the code above returns an error "undefined columns selected". – Stephen Turner Nov 26 '12 at 14:59
Hard to diagnose without seeing at the very least the str() of your data. But you could always subset and cast `dcast(your_df[1:2], sample ~ .,sum)` – Brandon Bertelsen Nov 26 '12 at 15:03
Small error in the ddply/summarise code. Should be: `ddply(df,.(sample),summarise, sum=sum(value))` which might fix the error you are seeing. – JAShapiro Nov 26 '12 at 15:26
Woah, "Community" just updated that automatically based on your comment. Pretty cool beans. – Brandon Bertelsen Nov 26 '12 at 15:27
Actually, that was me before I logged in. But when I logged in, the edit was not showing up, so I commented... I'm not sure I like that an unlogged user can update an answer like that... did you have to approve it? – JAShapiro Nov 26 '12 at 15:30
I would have been really impressed if that was an automatic edit! – Brandon Bertelsen Nov 26 '12 at 17:05

score 4 · Answer 3 · answered Nov 26 '12 at 16:08

In the spirit of sharing, you can also use the sqldf and data.table packages quite easily:

Your data:

df <- read.table(text="  sample value
1      a     1
2      a     2
3      b     3
4      b     4
5      b     5
6      c     6",header=TRUE)

The sqldf alternative:

library(sqldf)
sqldf("select sample, sum(value) `value` from df group by sample")
#   sample value
# 1      a     3
# 2      b    12
# 3      c     6

The data.table alternative:

library(data.table)
DT <- data.table(df, key="sample")
DT[, list(value = sum(value)), by=key(DT)]
#    sample value
# 1:      a     3
# 2:      b    12
# 3:      c     6

+1 for sqldf, very useful if you already know SQL. – Brandon Bertelsen Nov 26 '12 at 20:11 — Brandon Bertelsen, Nov 26 '12 at 20:11

score 3 · Answer 4 · answered Nov 26 '12 at 15:08

The "classic" R command is tapply

n <- 17; fac <- factor(rep(1:3, length = n), levels = 1:5)
df=data.frame(target=1:n, factor=fac)
with(df, tapply(target, fac, sum))

by and aggregate both work but the input output as lists or by objects is daft. Also ddply from plyr will work but is somewhat less clear syntactically although it comes into its own for more complicated examples.

score 1 · Answer 5 · answered Nov 26 '12 at 14:58

1

One solution using R base function is

aggregate(x = df$value, by = list(df$sample), FUN = sum)

You can also do it with ddply from plyr package

ddply(df,.(sample),numcolwise(sum))

Here df is your data.frame

answered Nov 26 '12 at 14:58

Jilber Urbina

58,147
10
114
138

Collapse data frame, adding values for factor

5 Answers5

Linked

Related