Consolidate duplicate rows

Question

I have a data frame where one column is species' names, and the second column is abundance values. Due to the sampling procedure, some species appear more than once (i.e., there is more than one row with Species X in it). I would like to consolidate those entries and sum their abundances.

For example, given this data frame:

set.seed(6)
df=data.frame(
  x=c("sp1","sp2","sp3","sp3","sp4","sp2","sp3"),
  y=rpois(7,2)); df

which produces:

    x y
1 sp1 2
2 sp2 4
3 sp3 1
4 sp3 1
5 sp4 3
6 sp2 5
7 sp3 5

I would like to instead produce:

    x y
1 sp1 2    
2 sp2 9     (5+4)
3 sp3 7     (5+1+1)
5 sp4 3

Thanks in advance for any help you can provide!

score 56 · Accepted Answer · edited May 23 '17 at 11:46

56

This works:

library(plyr)
ddply(df,"x",numcolwise(sum))

in words: (1) split the data frame df by the "x" column; (2) for each chunk, take the sum of each numeric-valued column; (3) stick the results back into a single data frame. (dd in ddply stands for "take a d ata frame as input, return a d ata frame")

Another, possibly clearer, approach:

aggregate(y~x,data=df,FUN=sum)

See quick/elegant way to construct mean/variance summary table for a related (slightly more complex) question.

edited May 23 '17 at 11:46

Community

1
1

answered Apr 16 '12 at 19:12

Ben Bolker

211,554
25
370
453

D'oh! How did this textbook application of ddply slip past me...Thanks Ben! – jslefche Apr 16 '12 at 20:38
+1 for application of numcolwise and sum. I was mucking about with transform and mutate trying to get exactly this to work. – Brandon Bertelsen Jan 26 '13 at 19:13
2

AFAIK, "aggregate(y~x+z,data=df,FUN=sum)" could be used to preserve an *additional* column (here "z"), should there be the need (was the need in my case). Please correct me if I'm wrong. – Shadow Jan 05 '16 at 12:43

score 31 · Answer 2 · answered Apr 16 '12 at 19:15

31

Simple as aggregate:

aggregate(df['y'], by=df['x'], sum)

answered Apr 16 '12 at 19:15

Joshua Ulrich

173,410
32
338
418

score 14 · Answer 3 · answered Jan 05 '16 at 13:34

14

A dplyr solution:

library(dplyr)
df %>% group_by(x) %>% summarise(y = sum(y))

answered Jan 05 '16 at 13:34

Carlos Cinelli

11,354
9
43
66

score 9 · Answer 4 · edited Sep 13 '12 at 10:46

9

A data.table solution for time and memory efficiency

library(data.table)
DT <- as.data.table(df)
# which columns are numeric 
numeric_cols <- which(sapply(DT, is.numeric))
DT[, lapply(.SD, sum), by = x, .SDcols = numeric_cols]

Or, in your case, given that you know that there is only the 1 column y you wish to sum over

DT[, list(y=sum(y)),by=x]

edited Sep 13 '12 at 10:46

Matt Dowle

58,872
22
166
224

answered Sep 13 '12 at 04:13

mnel

113,303
27
265
254

score 6 · Answer 5 · answered Apr 16 '12 at 19:16

> tapply(df$y, df$x, sum)
sp1 sp2 sp3 sp4 
  2   9   7   3

if it has to be a data.frame Ben's answer works great. or you can coerce the tapply output.

out <- tapply(df$y, df$x, sum)
>     data.frame(x=names(out), y=out, row.names=NULL)
    x y
1 sp1 2
2 sp2 9
3 sp3 7
4 sp4 3

score 2 · Answer 6 · answered Jan 05 '16 at 13:31

A MWE to verify whether a formula to respect a second variable (i.e., here "Z" and in addition to "X", would actually work:

example = data.frame(X=c("x"),Z=c("a"),Y=c(1), stringsAsFactors=F)
newrow = c("y","b",1)
example <- rbind(example, newrow)
newrow = c("z","a",0.5)
example <- rbind(example, newrow)
newrow = c("x","b",1)
example <- rbind(example, newrow)
newrow = c("x","b",2)
example <- rbind(example, newrow)
newrow = c("y","b",10)
example <- rbind(example, newrow)
example$X = as.factor(example$X)
example$Z = as.factor(example$Z)
example$Y = as.numeric(example$Y)
example_agg <- aggregate(Y~X+Z,data=example,FUN=sum)

Consolidate duplicate rows

6 Answers6

Linked

Related