R -- How can I calculate group means for a list of data frames, using a different subset condition to calculate each mean?

Question

I have a list of three data frames and would like to generate another list of three data frames whose rows consist of each of the values of a grouping variable (g1) and the means of six variables by the g1 variable. The twist is that I would like to calculate the means for the three continuous variables only when the value of the corresponding dummy variable equals 1.

Reproducible example:

    a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))

u[[1]]

> u
[[1]]
   g1 dummy1 dummy2 dummy3  contin1 contin2 contin3
1  fj      1      0      0       199      18      61
2  fj      1      0      0        91     158      28
3  fj      1      1      0       147      67     190
4   a      1      0      1       181     105      22
5  fj      0      1      0        14      16     156
6   a      0      0      0       178      14      98
7   g      0      0      1       116      97      30
8   g      1      1      1        48      31     144
9   g      0      0      0        60      21     112
10  g      0      1      0        95     145     199

I would like to calculate the mean of contin1 only when dummy1 = 1, mean of contin2 only when dummy2 = 1, and mean of contin3 only when dummy3 = 1

The output I WANT for the first list:

> rates
[[1]]
  x[, 1]   V1  V2  V3 x[, 1] x[, 6] x[, 1] x[, 7] x[, 1] x[, 8]
1      a 0.50 0.0 0.5      a 181         a  NA         a  22
2     fj 0.75 0.5 0.0     fj 145.67     fj  41.5      fj  NA
3      g 0.25 0.5 0.5      g  48         g  88         g  87

What I have tried:

rates <- lapply(u, function(x) {
    cbind(aggregate(cbind(x[,2],x[,3],x[,4]) ~ x[,1], FUN = mean, na.action = NULL),
    aggregate(x[,6] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,2] == 1)),
    aggregate(x[,7] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,3] == 1)),
    aggregate(x[,8] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,4] == 1)))
    })
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 3, 2

I understand that this error is coming from cbind, because cbind fails whenever you try to cbind objects with different numbers of rows. (The column x[, 6] has three rows whereas x[, 7] and x[, 8] have two.) I guess I was hoping that there was some way for aggregate to keep one row per grouping variable, which would mean that I would have the same number of rows and the cbind would work. Perhaps this is not possible per the R documentation?: "Rows with missing values in any of the by variables will be omitted from the result."

I have cafefully read the documentation for aggregate. The following two posts address similar issues but not using different subsets of the data to calculate the means.

R: Calculate means for subset of a group and Means from a list of data frames in R

Any suggestions would be hugely appreciated.

I'd suggest replacing the 0s in the dummies with NA, then multiplying the continX values by the corresponding dummyX values, then using na.rm=T in the aggregate function. (And would also suggest dplyr unless you're specifically after a base r solution.) — JonMinton, Mar 22 '15 at 22:33
@JonMinton: your idea works so long as none of the dummy columns contain all zeros for one of the groups, which is the problem I'm running up against in my code. — IamWill, Mar 22 '15 at 22:57

score 1 · Accepted Answer · answered Mar 23 '15 at 00:30

If you have dplyr installed, the following code seems to solve your problem.

library(dplyr)

set.seed(1234)

a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))


rates <- lapply(u, function(x)
  x %>% 
    mutate( contin1_ = ifelse(dummy1==1, contin1, NA) ) %>%
    mutate( contin2_ = ifelse(dummy2==1, contin2, NA) ) %>%
    mutate( contin3_ = ifelse(dummy3==1, contin3, NA) ) %>%
    group_by(g1) %>%
    summarize( 
              V1 = mean(dummy1, na.rm=TRUE),
              V2 = mean(dummy2, na.rm=TRUE),
              V3 = mean(dummy3, na.rm=TRUE),
              mean1 = mean(contin1_, na.rm=TRUE),
              mean2 = mean(contin2_, na.rm=TRUE),
              mean3 = mean(contin3_, na.rm=TRUE)
               )
)

print(rates[[1]])

Which gives me this:

Source: local data frame [3 x 7]

  g1   V1  V2  V3     mean1 mean2 mean3
1  a 0.50 0.0 0.5 128.00000   NaN    17
2 fj 0.75 0.5 0.0  94.66667    64   NaN
3  g 0.25 0.5 0.5  54.00000    57   146

The number that I get seem to be approximately correct and NA are in all the right places. Unfortunately, your example is not fully reproducible since you did not specify the seed for generating random variables and thus, my runif gives me different values than yours.

Thanks. Works perfectly. Spaced on not setting the seed :) – IamWill Mar 23 '15 at 00:40 — IamWill, Mar 23 '15 at 00:40

akrun · Answer 2 · 2015-03-23T05:26:27.703

Another option would be to change the format from 'wide' to 'long' and reconvert back to 'wide' after getting the 'mean' values. For multiple value columns, this is now possible with melt, dcast from the devel version of data.table i.e. v1.9.5. It can be installed from here. (Used the same dataset from @akhmed's post).

We can melt the datasets within the list ('u') by specifying the index of columns ('dummy' and 'contin') in measure.vars as a list. Get the mean of 'dummy' and 'contin' columns grouped by 'g1', and 'variable' (created from the 'melt'), dcast from long to wide by specifying the value.vars as 'dummyMean' and 'continMean'.

 res <-  lapply(u, function(x) {
   x1 <- melt(setDT(x), measure.vars=list(2:4,5:7),
                        value.name=c('dummy', 'contin'))
   x2 <- x1[, list(dummyMean = mean(dummy, na.rm=TRUE),
             continMean = mean(contin[dummy==1], na.rm=TRUE)), 
                           by=list(g1, variable)]

  dcast(x2, g1~variable, value.var=c('dummyMean', 'continMean'))})

 res[[1]]
 #   g1 1_dummyMean 2_dummyMean 3_dummyMean 1_continMean 2_continMean
 #1:  a        0.50         0.0         0.5    128.00000          NaN
 #2: fj        0.75         0.5         0.0     94.66667           64
 #3:  g        0.25         0.5         0.5     54.00000           57
 #    3_continMean
 #1:           17
 #2:          NaN
 #3:          146

Or a base R option using Map. Created functions 'fdummy', 'fcontin' to subset the 'dummy' and 'contin' columns. Loop through 'u' (lapply(...)). Use Map to get the corresponding columns of 'dummy' and 'contin', grouped by 'g1' column, get the mean of 'dummy' and mean of 'contin' columns with 'dummy==1' using tapply, cbind the results.

 fdummy <- function(x) x[grep('dummy', names(x))]
 fcontin <- function(x) x[grep('contin', names(x))]
 res2 <- lapply(u, function(x) {
        do.call(cbind.data.frame,
           Map(function(x,y,z) cbind(tapply(x,z, FUN=mean), 
                              tapply(y[x==1],z[x==1], FUN=mean)), 
                             fdummy(x), fcontin(x), x['g1']))})


lapply(res2, setNames, c(rbind(paste0('dummyMean', 1:3), 
                    paste0('continMean',1:3))))[[1]]
#    dummyMean1 continMean1 dummyMean2 continMean2 dummyMean3 continMean3
#a        0.50   128.00000        0.0          NA        0.5          17
#fj       0.75    94.66667        0.5          64        0.0          NA
#g        0.25    54.00000        0.5          57        0.5         146

R -- How can I calculate group means for a list of data frames, using a different subset condition to calculate each mean?

2 Answers2