Can different parts of dplyr::summarize() be computed conditionally?

Question

Is it possible to have conditional statements operate on different parts of dplyr::summarize()?

Imagine I am working with the iris data and outputting a summary and I want to only include the mean of Sepal.Length when requested. So I could do something like:

data(iris)
include_length = T
if (include_length) {
  iris %>% 
    group_by(Species) %>%
    summarize(mean_sepal_width = mean(Sepal.Width), mean_sepal_length = mean(Sepal.Length))
} else {
  iris %>% 
    group_by(Species) %>%
    summarize(mean_sepal_width = mean(Sepal.Width))

}

But is there a way to implement the conditional within the pipeline so that it does not need to be duplicated?

alistaire · Accepted Answer · 2016-11-01T23:28:54.917

You can use the .dots parameter of dplyr's SE functions to evauluate programmatically, e.g.

library(dplyr)

take_means <- function(include_length){
    iris %>% 
        group_by(Species) %>%
        summarize_(mean_sepal_width = ~mean(Sepal.Width), 
                   .dots = if(include_length){
                       list(mean_sepal_length = ~mean(Sepal.Length))
                   })
}

take_means(TRUE)
#> # A tibble: 3 × 3
#>      Species mean_sepal_width mean_sepal_length
#>       <fctr>            <dbl>             <dbl>
#> 1     setosa            3.428             5.006
#> 2 versicolor            2.770             5.936
#> 3  virginica            2.974             6.588

take_means(FALSE)
#> # A tibble: 3 × 2
#>      Species mean_sepal_width
#>       <fctr>            <dbl>
#> 1     setosa            3.428
#> 2 versicolor            2.770
#> 3  virginica            2.974

@Frank Yeah, I was couldn't decide which made it more obvious how it works, but I agree it makes more sense as a function, so I edited. — alistaire, Nov 01 '16 at 23:30
This should do the trick, though I will probably end up using the earlier edit. Thanks! — Kevin Burnham, Nov 02 '16 at 02:19

score 3 · Answer 2 · answered Nov 01 '16 at 20:25

In base R, you can do c(x, if (d) y) and depending on the value of d, you'll get the second element included or excluded from the result. x and y can be vectors or lists.

This trick works in data.table, since the return expression is a list:

library(data.table)
f = function(d) data.table(iris)[, c(
  .(mw = mean(Sepal.Width)), 
  if(d) .(ml = mean(Sepal.Length))
), by=Species]

Usage

> f(TRUE)
      Species    mw    ml
1:     setosa 3.428 5.006
2: versicolor 2.770 5.936
3:  virginica 2.974 6.588
> f(FALSE)
      Species    mw
1:     setosa 3.428
2: versicolor 2.770
3:  virginica 2.974

Inside DT[...] the .() is shorthand for list(). You may have reasons for wanting to hit the pipe, but I think this option is worth considering.

score 1 · Answer 3 · edited May 23 '17 at 12:06

1

It's about conditional evaluation with magrittr.

A possible solution:

library(magrittr)
library(dplyr)

data(iris)
include_length = T

iris %>%
  group_by(Species) %>%
  { if (include_length) {summarize(., mean_sepal_width = mean(Sepal.Width), mean_sepal_length = mean(Sepal.Length))} 
    else {summarize(., mean_sepal_width = mean(Sepal.Width))} 
  }

edited May 23 '17 at 12:06

Community

1
1

answered Nov 01 '16 at 20:21

nevrome

1,471
1
13
28

I think they wanted to avoid writing `mean_sepal_width = mean(Sepal.Width)` twice. – Frank Nov 01 '16 at 20:22
This is better, but in my actual use case the summarize statement computes 7 or 8 different variables, so ideally I would avoid that duplication, though that may be impossible. – Kevin Burnham Nov 01 '16 at 20:25

score 0 · Answer 4 · answered Nov 01 '16 at 21:02

A slightly hackish way:

iris %>%
    group_by(Species) %>%
    summarise(mean_sepal_length=if(include_length) mean(Sepal.Length) else NA,
              mean_sepal_width=mean(Sepal.Width))

This creates a column with the mean if include_length == TRUE, and NA otherwise. You can remove the NA column in post-processing if this is a problem.

Can different parts of dplyr::summarize() be computed conditionally?

4 Answers4