Suppose I'd like to calculate the mean, standard deviation, and n (number of non-NA values) for columns "dat_1" to "dat_3" of the following dataframe, grouped by the factors "fac_1" and "fac_2", such that separate dataframes for each statistic (or function) can be accessed from the result
set.seed(1)
df <- data.frame("fac_1" = c(rep("a", 5), rep("b", 4)),
"fac_2" = c("x", "x", "y","y", "y", "y", "x", "x", "x"),
"dat_1" = c(floor(runif(3, 0, 10)), NA, floor(runif(5, 0, 10))),
"dat_2" = floor(runif(9, 10, 20)),
"dat_3" = floor(runif(9, 20, 30)))
This can be achieved one function at a time using plyr
, as such
ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { colMeans(x[, 3:5], na.rm = T) } ) # mean
ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { psych::SD(x[, 3:5], na.rm = T) } ) # standrd deviation -- note uses SD from the 'psych' package
ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { colSums(!is.na(x[, 3:5])) } ) # number of non-NA values
but this becomes cumbersome when using multiple functions, especially when factors and columns of interest must be changed. I'm wondering if there's an alternative (a one-liner, perhaps).
Aggregate works
aggregate( x = df[, c(3:5)], by = df[, c(1,2)], FUN = function(x) c(n = length( !is.na(x) ), mean = mean(x, na.rm = T), sd = sd(x, na.rm = T) ) )
but 'disaggregating' the result (into separate dataframes for each statistic) becomes awkward.
Recently I've come across dplyr
. The following seems to work
df %>% group_by(fac_1, fac_2) %>% summarise_each(funs(n = length( !is.na(.) ), mean(., na.rm = TRUE), sd(., na.rm = TRUE) )) # using dplyr
however I'd like to be able to paste factors into group_by()
, and I've not found a way to do so.
Any help or ideas? Thanks