Following on from my questions:
1. Identifying whether a set of variables uniquely identifies each row of the data or not;
2. Tagging all rows that are duplicates in terms of a given set of variables,
I would now like to aggregate/consolidate all the duplicate rows in terms of a given set of variables, by taking their sum.
Solution 1:
There is some guidance on how to do this here, but when there are a large number of levels of the variables that form the index, the ddply
method recommended there is slow, as it was in the case where I was trying to tag all the duplicates by a given set of variables.
# Values of (f1, f2, f3, f4) uniquely identify observations
dfUnique = expand.grid(f1 = factor(1:16),
f2 = factor(1:41),
f3 = factor(1:2),
f4 = factor(1:104))
# sample some extra rows and rbind them
dfDup = rbind(dfUnique, dfUnique[sample(1:nrow(dfUnique), 100), ])
# dummy data
dfDup$data = rnorm(nrow(dfDup))
# aggregate the duplicate rows by taking the sum
dfDupAgg = ddply(dfDup, .(f1, f2, f3, f4), summarise, data = sum(data))
Solution 2:
The second solution is to use data.table
, and following the advice here, I could do
# data.table solution
indexVars = paste0('f', 1:4, sep = '')
dtDup = data.table(dfDup, key = indexVars)
dtDupAgg = dtDup[, list(data = sum(data)), by = key(dtDup)]
I have a couple of questions:
1. Is there a way to make the ddply
version faster?
2. Is the data.table
correct? I want to check since I am new to data.table
.