0

Suppose I'd like to calculate the mean, standard deviation, and n (number of non-NA values) for columns "dat_1" to "dat_3" of the following dataframe, grouped by the factors "fac_1" and "fac_2", such that separate dataframes for each statistic (or function) can be accessed from the result

set.seed(1)
df <- data.frame("fac_1" = c(rep("a", 5), rep("b", 4)),
             "fac_2" = c("x", "x", "y","y", "y", "y", "x", "x", "x"),
             "dat_1" = c(floor(runif(3, 0, 10)), NA, floor(runif(5, 0, 10))),
             "dat_2" = floor(runif(9, 10, 20)),
             "dat_3" = floor(runif(9, 20, 30)))

This can be achieved one function at a time using plyr, as such

ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { colMeans(x[, 3:5], na.rm = T) } ) # mean
ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { psych::SD(x[, 3:5], na.rm = T) } ) # standrd deviation -- note uses SD from the 'psych' package
ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { colSums(!is.na(x[, 3:5])) } ) # number of non-NA values

but this becomes cumbersome when using multiple functions, especially when factors and columns of interest must be changed. I'm wondering if there's an alternative (a one-liner, perhaps).

Aggregate works

aggregate( x = df[, c(3:5)], by = df[, c(1,2)], FUN = function(x) c(n = length( !is.na(x) ), mean = mean(x, na.rm = T), sd = sd(x, na.rm = T) ) )

but 'disaggregating' the result (into separate dataframes for each statistic) becomes awkward.

Recently I've come across dplyr. The following seems to work

df %>% group_by(fac_1, fac_2) %>% summarise_each(funs(n = length( !is.na(.) ), mean(., na.rm = TRUE), sd(., na.rm = TRUE) )) # using dplyr

however I'd like to be able to paste factors into group_by(), and I've not found a way to do so.

Any help or ideas? Thanks

pyg
  • 716
  • 6
  • 18
  • 1
    I'm not clear what you mean by "paste factors into `group_by`". Maybe you need `group_by_` for standard evaluation? – aosmith Aug 10 '16 at 02:19
  • Yes, please give an example of the thing that you want to do that you can't do. I would also re-write the question a bit to focus on the actual question -- which is the part that pertains to `group_by` and factors -- not the calculation of multiple statistics as you already have multiple working solutions to that problem. – Hack-R Aug 10 '16 at 02:24
  • A valid point @Hack-R. The clarity of the issue does leave much to be desired. It appears the solution has been offered for similar issues elsewhere ([this, for example](http://stackoverflow.com/questions/21208801/group-by-multiple-columns-in-dplyr-using-string-vector-input) ). To revise the question such that it complements the solution given might mean it becomes a duplicate. Please advise whether it's worth editing still, or otherwise. – pyg Aug 10 '16 at 10:25

1 Answers1

1

Passing vectors or lists to dplyr functions can be tricky (see this vignette.) In short, it involves adding an additional underscore, to use the standard evaluation version of a function, and then passing a vector or list to the .dots argument.

factorsToSummarise <-
  c('fac_1', 'fac_2')

   # extra underscore
        # |
df %>%  # v
  group_by_(.dots = factorsToSummarise) %>% 
  summarise_each(funs(n = length( !is.na(.) ), 
                      mean(., na.rm = TRUE), 
                      sd(., na.rm = TRUE) 
  )) # using dplyr
Mir Henglin
  • 629
  • 5
  • 15
  • Thank you @Mir (and @aosmith, above) for drawing my attention to `group_by_()` in conjunction with `.dots`. It affords a little more flexibility in `dplyr` and hence is perfect for what I was after – pyg Aug 10 '16 at 09:54