Nesting aggregate within apply to aggregate multiple columns by multiple variables in R

Question

I have a dataframe with sets of scores, and sets of grouping variables, something like:

s1 s2 s3 g1 g2 g3
4  3  7  F   F  T
6  2  2  T   T  T
2  4  9  G   G  F
1  3  1  T   F  G

I want to run an aggregate, at the moment I'm doing:

aggregate(df[c("s1","s2","s3")],df["g1"],function(x) c(m =mean(x, na.rm=T), sd = sd(x, na.rm=T), n = length(x)))

I'd like to have just one line of code, so I could aggregate the multiple variables by multiple factors all at once. Note I'm not trying to get a summary of s1-3 by combinations of g1-3 (as per answers here). I've looked at summaryBy in the doBy package, but again that seems to do combinations of each factor rather than just an overall which isn't what I want (useful though!). I've been playing with variants on:

apply(df[c("g1","g2","g3")], 2, function (z) aggregate(df[c("s1","s2","s3")],z,function(x) c(m =mean(x, na.rm=T), sd = sd(x, na.rm=T), n = length(x)))

But I get the error: "'by' must be a list" with that. I think I could work out how to do this with a loop and I know with various versions of ddply or reshape you can get aggregation but the most intuitive way (to me at least) seems to be an apply and aggregate - what am I missing?

I don't really mind, I was assuming a list. I prefer working with dataframes but it seems like reshaping the sets of output into a df is really a separate issue — sjgknight, Feb 23 '15 at 17:16
@sjgknight You could try `lapply(paste0('g',1:3), function(y) aggregate(cbind(s1,s2,s3)~., df[c(y,paste0('s',1:3))], function(x) c(mean=mean(x, na.rm=T), sd=sd(x, na.rm=T), n=length(x))))` — akrun, Feb 23 '15 at 17:29
I've tried to produce a simplified example for reproducibility purposes but actually the colnames aren't as uniform as that, tried to adapt: `lapply(c("IQ","PL"), function(y) aggregate(explor+sourDiv+sourQual+otherEval+Topic~., df[c(y,explor,sourDiv,sourQual,otherEval,Topic)], function(x) c(mean(x, na.rm=T), sd(x, na.rm=T), length(x))))` I'm getting an "object 'explor' not found" error (but it is a named column in the df). — sjgknight, Feb 23 '15 at 17:32
@sjgknight Check whether there is any leading/lagging spaces in the column names. Also instead of `+`, you should use `cbind` — akrun, Feb 23 '15 at 17:34
Thanks @akrun I like this not least because it's intuitive enough for me to understand! (I think: work through this list of factors (`lapply`) over this set of variables (1st bit of `aggregate`) `~.` by everything not used elsewhere, and use the lapply to create the right dataframe (`df[c(y,...)]`) ). No leading/lagging blankspace but it's not working. Is the way I've set the lapply likely to work there? (note there's a missing `)` between Topic~ which I've corrected in my code) — sjgknight, Feb 23 '15 at 17:50
@sjgknight Ok, now I understand the problem,. It should be `df[c(y, 'explor', 'sourDiv', 'sourQual',...)]` — akrun, Feb 23 '15 at 17:55

G. Grothendieck · Accepted Answer · 2015-02-23T18:02:15.797

Let us name the anonymous function in the question as follows. Then the Map statement at the end applies aggregate to df[1:3] separately by each grouping variable:

mean.sd.n <- function(x) c(m = mean(x, na.rm=T), sd = sd(x, na.rm=T), n = length(x))

Map(function(nm) aggregate(df[1:3], df[nm], mean.sd.n), names(df)[4:6])

giving:

$g1
  g1     s1.m    s1.sd     s1.n      s2.m     s2.sd      s2.n      s3.m     s3.sd      s3.n
1  F 4.000000       NA 1.000000 3.0000000        NA 1.0000000 7.0000000        NA 1.0000000
2  G 2.000000       NA 1.000000 4.0000000        NA 1.0000000 9.0000000        NA 1.0000000
3  T 3.500000 3.535534 2.000000 2.5000000 0.7071068 2.0000000 1.5000000 0.7071068 2.0000000

$g2
  g2    s1.m   s1.sd    s1.n s2.m s2.sd s2.n     s3.m    s3.sd     s3.n
1  F 2.50000 2.12132 2.00000    3     0    2 4.000000 4.242641 2.000000
2  G 2.00000      NA 1.00000    4    NA    1 9.000000       NA 1.000000
3  T 6.00000      NA 1.00000    2    NA    1 2.000000       NA 1.000000

$g3
  g3     s1.m    s1.sd     s1.n      s2.m     s2.sd      s2.n     s3.m    s3.sd     s3.n
1  F 2.000000       NA 1.000000 4.0000000        NA 1.0000000 9.000000       NA 1.000000
2  G 1.000000       NA 1.000000 3.0000000        NA 1.0000000 1.000000       NA 1.000000
3  T 5.000000 1.414214 2.000000 2.5000000 0.7071068 2.0000000 4.500000 3.535534 2.000000

Note: This could be shortened slightly by using fn$ from the gsubfn package. It allows us to specify the anonymous function in the line of code that starts with Map using formula notation as shown:

library(gsubfn)
fn$Map(nm ~ aggregate(df[1:3], df[nm], mean.sd.n), names(df)[4:6])

Nesting aggregate within apply to aggregate multiple columns by multiple variables in R

1 Answers1