1

Possible Duplicate:
Aggregate R sum

I have a data frame that looks like this:

  sample value
1      a     1
2      a     2
3      b     3
4      b     4
5      b     5
6      c     6

I need to collapse the data frame over the levels of sample (may be a character vector or a factor), adding all the values, so my resulting data frame looks like this:

  sample sum
1      a   3
2      b  12
3      c   6

It's Monday morning, and all I can think of is writing a complicated for loop. How might I vectorize this using apply, plyr, etc?

Community
  • 1
  • 1
Stephen Turner
  • 2,574
  • 8
  • 31
  • 44

5 Answers5

7

If you don't want to load a package:

df <- read.table(text="  sample value
1      a     1
2      a     2
3      b     3
4      b     4
5      b     5
6      c     6",header=TRUE)

aggregate(df$value,by=list(sample=df$sample),sum)

  sample  x
1      a  3
2      b 12
3      c  6

Or if you prefer formula syntax:

aggregate(value ~ sample, df, sum)

  sample value
1      a     3
2      b    12
3      c     6
Roland
  • 127,288
  • 10
  • 191
  • 288
4

I like cast, for these types of problems because it's quick and intuitive:

library(reshape2)
dcast(your_df, sample ~ ., sum) # or just cast with the original reshape

I also like summarize, for this type of question.

library(plyr) 
ddply(df,.(sample),summarise, sum=sum(value))
Community
  • 1
  • 1
Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
  • Thanks Brandon. Upvoted because this works on my simple example, but when my data frame has many other columns, the code above returns an error "undefined columns selected". – Stephen Turner Nov 26 '12 at 14:59
  • Hard to diagnose without seeing at the very least the str() of your data. But you could always subset and cast `dcast(your_df[1:2], sample ~ .,sum)` – Brandon Bertelsen Nov 26 '12 at 15:03
  • Small error in the ddply/summarise code. Should be: `ddply(df,.(sample),summarise, sum=sum(value))` which might fix the error you are seeing. – JAShapiro Nov 26 '12 at 15:26
  • Woah, "Community" just updated that automatically based on your comment. Pretty cool beans. – Brandon Bertelsen Nov 26 '12 at 15:27
  • Actually, that was me before I logged in. But when I logged in, the edit was not showing up, so I commented... I'm not sure I like that an unlogged user can update an answer like that... did you have to approve it? – JAShapiro Nov 26 '12 at 15:30
  • I would have been really impressed if that was an automatic edit! – Brandon Bertelsen Nov 26 '12 at 17:05
4

In the spirit of sharing, you can also use the sqldf and data.table packages quite easily:

Your data:

df <- read.table(text="  sample value
1      a     1
2      a     2
3      b     3
4      b     4
5      b     5
6      c     6",header=TRUE)

The sqldf alternative:

library(sqldf)
sqldf("select sample, sum(value) `value` from df group by sample")
#   sample value
# 1      a     3
# 2      b    12
# 3      c     6

The data.table alternative:

library(data.table)
DT <- data.table(df, key="sample")
DT[, list(value = sum(value)), by=key(DT)]
#    sample value
# 1:      a     3
# 2:      b    12
# 3:      c     6
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
3

The "classic" R command is tapply

n <- 17; fac <- factor(rep(1:3, length = n), levels = 1:5)
df=data.frame(target=1:n, factor=fac)
with(df, tapply(target, fac, sum))

by and aggregate both work but the input output as lists or by objects is daft. Also ddply from plyr will work but is somewhat less clear syntactically although it comes into its own for more complicated examples.

Stephen Henderson
  • 6,340
  • 3
  • 27
  • 33
1

One solution using R base function is

aggregate(x = df$value, by = list(df$sample), FUN = sum)

You can also do it with ddply from plyr package

ddply(df,.(sample),numcolwise(sum))

Here df is your data.frame

Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138