I have a list of three data frames and would like to generate another list of three data frames whose rows consist of each of the values of a grouping variable (g1) and the means of six variables by the g1 variable. The twist is that I would like to calculate the means for the three continuous variables only when the value of the corresponding dummy variable equals 1.
Reproducible example:
a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))
u[[1]]
> u
[[1]]
g1 dummy1 dummy2 dummy3 contin1 contin2 contin3
1 fj 1 0 0 199 18 61
2 fj 1 0 0 91 158 28
3 fj 1 1 0 147 67 190
4 a 1 0 1 181 105 22
5 fj 0 1 0 14 16 156
6 a 0 0 0 178 14 98
7 g 0 0 1 116 97 30
8 g 1 1 1 48 31 144
9 g 0 0 0 60 21 112
10 g 0 1 0 95 145 199
I would like to calculate the mean of contin1 only when dummy1 = 1, mean of contin2 only when dummy2 = 1, and mean of contin3 only when dummy3 = 1
The output I WANT for the first list:
> rates
[[1]]
x[, 1] V1 V2 V3 x[, 1] x[, 6] x[, 1] x[, 7] x[, 1] x[, 8]
1 a 0.50 0.0 0.5 a 181 a NA a 22
2 fj 0.75 0.5 0.0 fj 145.67 fj 41.5 fj NA
3 g 0.25 0.5 0.5 g 48 g 88 g 87
What I have tried:
rates <- lapply(u, function(x) {
cbind(aggregate(cbind(x[,2],x[,3],x[,4]) ~ x[,1], FUN = mean, na.action = NULL),
aggregate(x[,6] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,2] == 1)),
aggregate(x[,7] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,3] == 1)),
aggregate(x[,8] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,4] == 1)))
})
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 3, 2
I understand that this error is coming from cbind, because cbind fails whenever you try to cbind objects with different numbers of rows. (The column x[, 6] has three rows whereas x[, 7] and x[, 8] have two.) I guess I was hoping that there was some way for aggregate to keep one row per grouping variable, which would mean that I would have the same number of rows and the cbind would work. Perhaps this is not possible per the R documentation?: "Rows with missing values in any of the by variables will be omitted from the result."
I have cafefully read the documentation for aggregate. The following two posts address similar issues but not using different subsets of the data to calculate the means.
R: Calculate means for subset of a group and Means from a list of data frames in R
Any suggestions would be hugely appreciated.