7

Using dplyr, I'd like to summarize [sic] by a variable that I can vary (e.g. in a loop or apply-style command).

Typing the names in directly works fine:

library(dplyr)
ChickWeight %>% group_by( Chick, Diet ) %>% summarise( mw = mean( weight ) )

But group_by wasn't written to take a character vector, so passing in results is harder.

v <- "Diet"
ChickWeight %>% group_by( c( "Chick", v ) ) %>% summarise( mw = mean( weight ) )
## Error

I'll post one solution, but curious to see how others have solved this.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • 1
    :-) `summarize [sic]` +1 – Tyler Rinker Feb 08 '15 at 00:26
  • 4
    Just do `group_by_( c( "Chick", v ) )` instead of `group_by( c( "Chick", v ) )`.... – David Arenburg Feb 08 '15 at 00:43
  • @Ari If you use US spelling, why do you use `summarise` in code? – Konrad Rudolph Feb 08 '15 at 00:44
  • 1
    And of course, if it wasn't possible with `dplyr`, you could also just do it easily with `data.table` :) as in `library(data.table) ; as.data.table(ChickWeight)[, .(mw = mean(weight)), c("Chick", v)]` – David Arenburg Feb 08 '15 at 00:59
  • @KonradRudolph One more function call wrapped around things? In deference to Hadley's native ways? Out of habit from older Hadley packages? Dunno. :-) – Ari B. Friedman Feb 08 '15 at 01:09
  • 2
    @KonradRudolph - I use `summarise` as well, mainly because there is no `summarize_each`. One less thing I have to remember. – Rich Scriven Feb 08 '15 at 01:31
  • 1
    @Richard The use of UK English in Hadley’s library is an unfortunate (= bad) decision. APIs should be uniform, not personalised. I favour British spelling in all my writing, yet I adhere to the uniform, established, US spelling in my code. It’s very annoying and breaks all kinds of principles of API design when other code breaks that rule (there’s a reason non-English programming languages are usually seen as a failed experiment). As such, I strongly recommend adhering to the US spelling (and the lack of `summarize_each` is probably an oversight). – Konrad Rudolph Feb 08 '15 at 10:50
  • @KonradRudolph, there's [an issue](https://github.com/hadley/dplyr/issues/891) on github asking for a `summarize_each` alias. – talat Feb 08 '15 at 14:10
  • @docendodiscimus There are actually at least two pull requests to fix it – I almost added a third this morning, before finding the other two. – Konrad Rudolph Feb 08 '15 at 14:52

2 Answers2

11

The underscore functions of dplyr could be useful for that:

ChickWeight %>% group_by_( "Chick", v )  %>% summarise( mw = mean( weight ) )

From the new features in dplyr 0.3:

You can now program with dplyr – every function that uses non-standard evaluation (NSE) also has a standard evaluation (SE) twin that ends in _. For example, the SE version of filter() is called filter_(). The SE version of each function has similar arguments, but they must be explicitly “quoted”.

NicE
  • 21,165
  • 3
  • 51
  • 68
0

Here's one solution and how I arrived at it.

What does group_by expect?

> group_by
function (x, ..., add = FALSE) 
{
    new_groups <- named_dots(...)

Down the rabbit hole:

> dplyr:::named_dots
function (...) 
{
    auto_name(dots(...))
}
<environment: namespace:dplyr>
> dplyr:::auto_name
function (x) 
{
    names(x) <- auto_names(x)
    x
}
<environment: namespace:dplyr>
> dplyr:::auto_names
function (x) 
{
    nms <- names2(x)
    missing <- nms == ""
    if (all(!missing)) 
        return(nms)
    deparse2 <- function(x) paste(deparse(x, 500L), collapse = "")
    defaults <- vapply(x[missing], deparse2, character(1), USE.NAMES = FALSE)
    nms[missing] <- defaults
    nms
}
<environment: namespace:dplyr>
> dplyr:::names2
function (x) 
{
    names(x) %||% rep("", length(x))
}

Using that information, how to go about crafting a solution?

# Naive solution fails:
ChickWeight %>% do.call( group_by, list( Chick, Diet ) ) %>% summarise( mw = mean( weight ) )

# Slightly cleverer:
do.call( group_by, list( x = ChickWeight, Chick, Diet, add = FALSE ) ) %>% summarise( mw = mean( weight ) )
## But still fails with,
## Error in do.call(group_by, list(x = ChickWeight, Chick, Diet, add = FALSE)) : object 'Chick' not found

The solution lies in quoting the arguments so their evaluation is delayed until they're in the environment that includes the x tbl:

do.call( group_by, list( x = ChickWeight, quote(Chick), quote(Diet), add = FALSE ) ) %>% summarise( mw = mean( weight ) )
## Bingo!
v <- "Diet"
do.call( group_by, list( x = ChickWeight, quote(Chick), substitute( a, list( a = v ) ), add = FALSE ) ) %>% summarise( mw = mean( weight ) )
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235