Taking row means based on a partition of the columns

Question

I have a matrix mat and would like to calculate the mean of the columns based on a grouping variable gp.

mat<-embed(1:5000,1461)
gp<-c(rep(1:365,each=4),366)

To do this, I use the following

colavg<-t(aggregate(t(mat),list(gp),mean))

But it takes much longer than I expect.

Any suggestions on making the code run faster?

First steps for speeding up R code: http://stackoverflow.com/a/8474941/636656 — Ari B. Friedman, Mar 31 '12 at 12:17
@gsk3 Thanks for the pointer. I am not familiar with data.table, but will do some readings on it. — Tony, Mar 31 '12 at 12:44

score 2 · Accepted Answer · answered Mar 31 '12 at 13:11

Here is a fast algorithm, I commented in the code.

system.time({

# create a list of column indices per group
gp.list    <- split(seq_len(ncol(mat)), gp)

# for each group, compute the row means
means.list <- lapply(gp.list, function(cols)rowMeans(mat[,cols, drop = FALSE]))

# paste everything together
colavg     <- do.call(cbind, means.list)

})
#    user  system elapsed 
#    0.08    0.00    0.08

score 1 · Answer 2 · answered Mar 31 '12 at 12:56

You could use an apply function, for example from the excellent plyr package:

# Create data
mat<-embed(1:5000,1461)
gp<-c(rep(1:365,each=4),366)

# Your code
system.time(colavg<-t(aggregate(t(mat),list(gp),mean)))

library(plyr)
# Put all data in a data frame
df <- data.frame(t(mat))
df$gp <- gp

# Using an apply function
system.time(colavg2 <- t(daply(df, .(gp), colMeans)))

Output:

> # Your code
> system.time(colavg<-t(aggregate(t(mat),list(gp),mean)))
   user  system elapsed 
 134.21    1.64  139.00 

> # Using an apply function
> system.time(colavg2 <- t(daply(df, .(gp), colMeans)))
   user  system elapsed 
  52.78    0.06   53.23

Taking row means based on a partition of the columns

2 Answers2