1

I need to aggregate over a number of dependent measures (DMs) in R. I found the following discussion here quite useful:

Aggregate / summarize multiple variables per group (i.e. sum, mean, etc)

Based on this, the code below basically does what I need. It gets quite lengthy, however, as the number of DMs increases (I have many DMs):

aggregate(cbind(DM1, DM2, DV3, DM4, DM5 ... DMn) ~ F1 + F2 +
           F3, data = sst2, mean, na.rm=TRUE) 

I was therefore wondering if there was a more efficient way of writing the DMs, without having to individually type every one of them. Most DMs of interest are next to one another (i.e. DM3, DM4, DM5 etc.), so I was thinking of using something along the lines of cbind(DM1, DM3:DM10, DM14), but this doesn't seem to work. I also tried generating a list of the relevant column names. Unfortunately this didn't work either:

pr<-colnames(sst2)
pr2<-pr[pr!="DM2" & pr!="DM11" & pr!="DM12" & pr!="DM13"]
pr3<-noquote(paste(pr2,collapse=","))
pp<-aggregate(cbind(pr3) ~ F1 + F2 +
           F3, data = sst2, mean, na.rm=TRUE) 

Any suggestions on how to efficiently include a large number of DMs in the aggregate function (or other related functions such as ddply) would be much appreciated.

Community
  • 1
  • 1
Tiberius
  • 331
  • 1
  • 9

2 Answers2

1

I believe that this should work

sst2 <- data.frame(F1=c("A","A","B","B","C","C"),
                   F2=c("A","A","A","B","B","B"),
                   F3=c("D","D","D","D","D","D"),
                   DM1=c(5,6,21,61,2,3),
                   DM2=c(1,5,3,6,1,6),
                   DM3=c(1,7,9,1,4,44))

n = 3 # number of DM columns
m = 2 # number of F columns

DM <- paste0("DM", 1:n)

attach(sst2)

# use sapply(DM,get) but this produces separate columns
tmp <- aggregate(sapply(DM, get) ~ F1 + F2, 
                 data = sst2, mean, na.rm=TRUE)

detach(sst2)

# combine these separate columns. The apply is to each row of tmp
data.frame(F1 = tmp$F1, F2 = tmp$F2,
    DM = apply(tmp[(m+1):(n+length(DM)-1)], 1, mean))

#   F1 F2        DM
# 1  A  A  4.166667
# 2  B  A 11.000000
# 3  B  B 22.666667
# 4  C  B 10.000000

Edit

If your variable names are different than the only line that would need to change is

DM <- c("mean.go.RT", "mean.SRT", "mean.SSD", "SSRT")

If these variables are in your data frame, you could easily get them with

DM <- names(sst2)[4:6]

or whatever other columns (i.e. instead of 4-6) that you want

nathanesau
  • 1,681
  • 16
  • 27
  • Thank you very much for your quick reply. I have just tried this out and get an error message at the call of the aggregate function: > tmp <- aggregate(sapply(DM, get) ~ F1 + F2, + data = sst2, mean, na.rm=TRUE) Error in FUN(X[[i]], ...) : object 'DM1' not found I did load the previous lines in your script as well (obviously). So, frankly, I am a bit puzzled why it cannot be found, as it should be loaded with sst2 and DM in your script. Any ideas? – Tiberius Aug 17 '15 at 20:47
  • Also, in my data the variable names are more diverse (e.g. mean.go.RT, mean.SRT, mean.SSD, SSRT instead of DM1-DM4). For my question here I had tried to simplify this a bit for the sake of clarity, but it seems that this simplification actually changes the solution somewhat - apologies for this! Nonetheless, trying to adapt your idea to my script, I receive the same error message: pr3<-paste(pr2) tmp <- aggregate(sapply(pr3, get) ~ F1 + F2, data = sst2, mean, na.rm=TRUE) Error in FUN(X[[i]], ...) : object etc. Do you know why this might be? – Tiberius Aug 17 '15 at 20:57
  • Yes I know - you need to ``attach`` the object, or get can't find it. – nathanesau Aug 17 '15 at 21:12
0

An alternative solution using select, ddply and numcolwise:

library(dplyr)
library(plyr)

sst21 <- data.frame(F1=c("A","A","B","B","C","C"),
                   F2=c("A","A","A","B","B","B"),
                   F3=c("D","D","D","D","D","D"),
                   DM1=c(5,6,21,61,2,3),
                   DM2=c(1,5,3,6,1,6),
                   DM3=c(1,7,9,1,4,44),
                   DM4=c(2,3,6,7,2,33),
                   DM5=c(44,55,66,77,55,88))

sel1 <- dplyr::select(sst21, starts_with("F"), .data$DM1 : .data$DM3, .data$DM5) # select columns of interest
sel1 <- dplyr::select(sst21, -c(.data$DM4)) # Alternative: specifying columns to be excluded

sst22 <- plyr::ddply(sel1, .(F1, F2, F3), plyr::numcolwise(mean, na.rm = TRUE)) # Aggregate selected data
Tiberius
  • 331
  • 1
  • 9