Using ddply with multiple functions on lists of columns

Question

I have a data frame with a by variable, and multiple variables to aggregate, but by different functions.

d <- data.frame(year=c(rep(2011,5), rep(2012,5)),
            v1 = sample(1:10, 10),
            v2 = sample(1:10, 10),
            v3 = sample(1:10, 10),
            v4 = sample(1:10, 10)
            )
d

#     year v1 v2 v3 v4
# 1  2011  1  7  1  3
# 2  2011  6  3  2 10
# 3  2011  7  9  5  8
# 4  2011 10  8  6  9
# 5  2011  3  2  8  4
# 6  2012  9  5  7  6
# 7  2012  2  6  9  5
# 8  2012  4  1  4  7
# 9  2012  5  4  3  1
# 10 2012  8 10 10  2

Now, v1 and v2 need to be aggregate by sum, and v3 and v4 by mean. If these variable names are available explicitly as literals, ddply with summarize works well, as:

library(plyr)

ddply(d, "year", summarize, a1=sum(v1), a2=sum(v2), a3=mean(v3), a4=mean(v4))
#   year a1 a2  a3  a4
# 1 2011 27 29 4.4 6.8
# 2 2012 28 26 6.6 4.2

However, to me, the two lists of columns are available as vectors only. i.e.:

cols1 <- c("v1", "v2")
cols2 <- c("v3", "v4")
# cols1 and cols2 are dynamically generated at runtime.
# v1,v2,v3,v4 are not directly available.

I have tried to achieve the aggregations by these two methods, but neither works:

# ddply without summarize
ddply(d, "year", function(x) cbind(colSums(x[cols1]), colMeans(x[cols2])))
# weird output!

# ddply with summarize
ddply(d, "year", summarize, colSums(cols1), colMeans(cols2))
#Error in colSums(cols1) : 'x' must be an array of at least two dimensions

If the best way to do this does not use ddply (say aggregate, maybe), that's perfectly okay.

The best workaround I have right now is doing the two aggregations separately, and then merging the two data frames using the aggregation by-variable.

Is [**this**](http://stackoverflow.com/questions/6955128/object-not-found-error-with-ddply-inside-a-function) relevant? — Henrik, Apr 10 '14 at 07:26
@Roland Thanks, that does work. I'm not sure I follow the logic to it, though. — ninjasnowman, Apr 10 '14 at 10:38
@ninjasnowman Have a look at `help("ddply")` (section "Output"). — Roland, Apr 10 '14 at 10:58
@Roland "The most unambiguous behaviour is achieved when .fun returns a data frame - in that case pieces will be combined with rbind.fill. If .fun returns an atomic vector of fixed length, it will be rbinded together and converted to a data frame. Any other values will result in an error." Per my understanding, cbind() would return a data frame, and c() would return a list of vectors. — ninjasnowman, Apr 10 '14 at 12:18
@ninjasnowman `cbind` returns a matrix here. You could use `cbind.data.frame` to return a 2x2 data.frame. However, you'd want a 1x4 data.frame, which is what is created automatically if your function returns a vector. — Roland, Apr 10 '14 at 14:00

Using ddply with multiple functions on lists of columns

0 Answers0