49

I have a data frame where one column is species' names, and the second column is abundance values. Due to the sampling procedure, some species appear more than once (i.e., there is more than one row with Species X in it). I would like to consolidate those entries and sum their abundances.

For example, given this data frame:

set.seed(6)
df=data.frame(
  x=c("sp1","sp2","sp3","sp3","sp4","sp2","sp3"),
  y=rpois(7,2)); df

which produces:

    x y
1 sp1 2
2 sp2 4
3 sp3 1
4 sp3 1
5 sp4 3
6 sp2 5
7 sp3 5

I would like to instead produce:

    x y
1 sp1 2    
2 sp2 9     (5+4)
3 sp3 7     (5+1+1)
5 sp4 3

Thanks in advance for any help you can provide!

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
jslefche
  • 4,379
  • 7
  • 39
  • 50

6 Answers6

56

This works:

library(plyr)
ddply(df,"x",numcolwise(sum))

in words: (1) split the data frame df by the "x" column; (2) for each chunk, take the sum of each numeric-valued column; (3) stick the results back into a single data frame. (dd in ddply stands for "take a d ata frame as input, return a d ata frame")

Another, possibly clearer, approach:

aggregate(y~x,data=df,FUN=sum)

See quick/elegant way to construct mean/variance summary table for a related (slightly more complex) question.

Community
  • 1
  • 1
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • D'oh! How did this textbook application of ddply slip past me...Thanks Ben! – jslefche Apr 16 '12 at 20:38
  • +1 for application of numcolwise and sum. I was mucking about with transform and mutate trying to get exactly this to work. – Brandon Bertelsen Jan 26 '13 at 19:13
  • 2
    AFAIK, "aggregate(y~x+z,data=df,FUN=sum)" could be used to preserve an *additional* column (here "z"), should there be the need (was the need in my case). Please correct me if I'm wrong. – Shadow Jan 05 '16 at 12:43
31

Simple as aggregate:

aggregate(df['y'], by=df['x'], sum)
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
14

A dplyr solution:

library(dplyr)
df %>% group_by(x) %>% summarise(y = sum(y))
Carlos Cinelli
  • 11,354
  • 9
  • 43
  • 66
9

A data.table solution for time and memory efficiency

library(data.table)
DT <- as.data.table(df)
# which columns are numeric 
numeric_cols <- which(sapply(DT, is.numeric))
DT[, lapply(.SD, sum), by = x, .SDcols = numeric_cols]

Or, in your case, given that you know that there is only the 1 column y you wish to sum over

DT[, list(y=sum(y)),by=x]
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
mnel
  • 113,303
  • 27
  • 265
  • 254
6
> tapply(df$y, df$x, sum)
sp1 sp2 sp3 sp4 
  2   9   7   3 

if it has to be a data.frame Ben's answer works great. or you can coerce the tapply output.

out <- tapply(df$y, df$x, sum)
>     data.frame(x=names(out), y=out, row.names=NULL)
    x y
1 sp1 2
2 sp2 9
3 sp3 7
4 sp4 3
Justin
  • 42,475
  • 9
  • 93
  • 111
2

A MWE to verify whether a formula to respect a second variable (i.e., here "Z" and in addition to "X", would actually work:

example = data.frame(X=c("x"),Z=c("a"),Y=c(1), stringsAsFactors=F)
newrow = c("y","b",1)
example <- rbind(example, newrow)
newrow = c("z","a",0.5)
example <- rbind(example, newrow)
newrow = c("x","b",1)
example <- rbind(example, newrow)
newrow = c("x","b",2)
example <- rbind(example, newrow)
newrow = c("y","b",10)
example <- rbind(example, newrow)
example$X = as.factor(example$X)
example$Z = as.factor(example$Z)
example$Y = as.numeric(example$Y)
example_agg <- aggregate(Y~X+Z,data=example,FUN=sum)
Shadow
  • 1,042
  • 2
  • 15
  • 23