I have a data frame with a by variable, and multiple variables to aggregate, but by different functions.
d <- data.frame(year=c(rep(2011,5), rep(2012,5)),
v1 = sample(1:10, 10),
v2 = sample(1:10, 10),
v3 = sample(1:10, 10),
v4 = sample(1:10, 10)
)
d
# year v1 v2 v3 v4
# 1 2011 1 7 1 3
# 2 2011 6 3 2 10
# 3 2011 7 9 5 8
# 4 2011 10 8 6 9
# 5 2011 3 2 8 4
# 6 2012 9 5 7 6
# 7 2012 2 6 9 5
# 8 2012 4 1 4 7
# 9 2012 5 4 3 1
# 10 2012 8 10 10 2
Now, v1 and v2 need to be aggregate by sum, and v3 and v4 by mean. If these variable names are available explicitly as literals, ddply with summarize works well, as:
library(plyr)
ddply(d, "year", summarize, a1=sum(v1), a2=sum(v2), a3=mean(v3), a4=mean(v4))
# year a1 a2 a3 a4
# 1 2011 27 29 4.4 6.8
# 2 2012 28 26 6.6 4.2
However, to me, the two lists of columns are available as vectors only. i.e.:
cols1 <- c("v1", "v2")
cols2 <- c("v3", "v4")
# cols1 and cols2 are dynamically generated at runtime.
# v1,v2,v3,v4 are not directly available.
I have tried to achieve the aggregations by these two methods, but neither works:
# ddply without summarize
ddply(d, "year", function(x) cbind(colSums(x[cols1]), colMeans(x[cols2])))
# weird output!
# ddply with summarize
ddply(d, "year", summarize, colSums(cols1), colMeans(cols2))
#Error in colSums(cols1) : 'x' must be an array of at least two dimensions
If the best way to do this does not use ddply (
say aggregate
, maybe), that's perfectly okay.
The best workaround I have right now is doing the two aggregations separately, and then merging the two data frames using the aggregation by-variable.