4

I am trying to write a function in R that summarizes a data frame according to grouping variables. The grouping variables are given as a list and passed to group_by_at, and I would like to parametrize them.

What I am doing now is this:

library(tidyverse)

d = tribble(
  ~foo, ~bar, ~baz,
  1, 2, 3,
  1, 3, 5
  4, 5, 6,
  4, 5, 1
)

sum_fun <- function(df, group_vars, sum_var) {
  sum_var = enquo(sum_var)
  return(
    df %>% 
      group_by_at(.vars = group_vars) %>% 
      summarize(sum(!! sum_var))
  )
}

d %>% sum_fun(group_vars = c("foo", "bar"), baz)

However, I would like to call the function like so:

d %>% sum_fun(group_vars = c(foo, bar), baz)

Which means the grouping vars should not be evaluated in the call, but in the function. How would I go about rewriting the function to enable that?

I have tried using enquo just like for the summary variable, and then replacing group_vars with !! group_vars, but it leads to this error:

Error in !group_vars : invalid argument type

Using group_by(!!!group_vars) yields:

Column `c(foo, bar)` must be length 2 (the number of rows) or one, not 4 

What would be the proper way to rewrite the function?

Tung
  • 26,371
  • 7
  • 91
  • 115
slhck
  • 36,575
  • 28
  • 148
  • 201

3 Answers3

9

I'd just use vars to do the quoting. Here is an example using mtcars dataset

library(tidyverse)

sum_fun <- function(.data, .summary_var, .group_vars) {
  summary_var <- enquo(.summary_var)

  .data %>%
    group_by_at(.group_vars) %>%
    summarise(mean = mean(!!summary_var))
}

sum_fun(mtcars, disp, .group_vars = vars(cyl, am))
#> # A tibble: 6 x 3
#> # Groups:   cyl [?]
#>     cyl    am  mean
#>   <dbl> <dbl> <dbl>
#> 1     4     0 136. 
#> 2     4     1  93.6
#> 3     6     0 205. 
#> 4     6     1 155  
#> 5     8     0 358. 
#> 6     8     1 326

You can also replace .group_vars with ... (dot-dot-dot)

sum_fun2 <- function(.data, .summary_var, ...) {
  summary_var <- enquo(.summary_var)

  .data %>%
    group_by_at(...) %>%  # Forward `...`
    summarise(mean = mean(!!summary_var))
}

sum_fun2(mtcars, disp, vars(cyl, am))
#> # A tibble: 6 x 3
#> # Groups:   cyl [?]
#>     cyl    am  mean
#>   <dbl> <dbl> <dbl>
#> 1     4     0 136. 
#> 2     4     1  93.6
#> 3     6     0 205. 
#> 4     6     1 155  
#> 5     8     0 358. 
#> 6     8     1 326

If you prefer to supply inputs as a list of columns, you will need to use enquos for the ...

sum_fun3 <- function(.data, .summary_var, ...) {
  summary_var <- enquo(.summary_var)

  group_var <- enquos(...)
  print(group_var)

  .data %>%
      group_by_at(group_var) %>% 
      summarise(mean = mean(!!summary_var))
}

sum_fun3(mtcars, disp, c(cyl, am))
#> [[1]]
#> <quosure>
#>   expr: ^c(cyl, am)
#>   env:  global
#> 
#> # A tibble: 6 x 3
#> # Groups:   cyl [?]
#>     cyl    am  mean
#>   <dbl> <dbl> <dbl>
#> 1     4     0 136. 
#> 2     4     1  93.6
#> 3     6     0 205. 
#> 4     6     1 155  
#> 5     8     0 358. 
#> 6     8     1 326

Edit: append an .addi_var to .../.group_var.

sum_fun4 <- function(.data, .summary_var, .addi_var, .group_vars) {
  summary_var <- enquo(.summary_var)

  .data %>%
    group_by_at(c(.group_vars, .addi_var)) %>%
    summarise(mean = mean(!!summary_var))
}

sum_fun4(mtcars, disp, .addi_var = vars(gear), .group_vars = vars(cyl, am))
#> # A tibble: 10 x 4
#> # Groups:   cyl, am [?]
#>      cyl    am  gear  mean
#>    <dbl> <dbl> <dbl> <dbl>
#>  1     4     0     3 120. 
#>  2     4     0     4 144. 
#>  3     4     1     4  88.9
#>  4     4     1     5 108. 
#>  5     6     0     3 242. 
#>  6     6     0     4 168. 
#>  7     6     1     4 160  
#>  8     6     1     5 145  
#>  9     8     0     3 358. 
#> 10     8     1     5 326

group_by_at() can also take input as a character vector of column names

sum_fun5 <- function(.data, .summary_var, .addi_var, ...) {

  summary_var <- enquo(.summary_var)
  addi_var    <- enquo(.addi_var)
  group_var   <- enquos(...)

  ### convert quosures to strings for `group_by_at`
  all_group <- purrr::map_chr(c(addi_var, group_var), quo_name)

  .data %>%
    group_by_at(all_group) %>% 
    summarise(mean = mean(!!summary_var))
}

sum_fun5(mtcars, disp, gear, cyl, am)
#> # A tibble: 10 x 4
#> # Groups:   gear, cyl [?]
#>     gear   cyl    am  mean
#>    <dbl> <dbl> <dbl> <dbl>
#>  1     3     4     0 120. 
#>  2     3     6     0 242. 
#>  3     3     8     0 358. 
#>  4     4     4     0 144. 
#>  5     4     4     1  88.9
#>  6     4     6     0 168. 
#>  7     4     6     1 160  
#>  8     5     4     1 108. 
#>  9     5     6     1 145  
#> 10     5     8     1 326

Created on 2018-10-09 by the reprex package (v0.2.1.9000)

Tung
  • 26,371
  • 7
  • 91
  • 115
  • 1
    That looks much more consistent with the tidyverse – thank you! – slhck Oct 10 '18 at 06:57
  • Quick follow-up: What if in my `sum_fun`, I wanted to have an additional argument named `.additional_var` that gets appended to `.group_vars` in the `group_by_at` call? – slhck Oct 10 '18 at 07:03
  • @slhck: see my edit. We have to do it a little bit differently this time – Tung Oct 10 '18 at 08:42
  • 1
    Oh, interesting approach – there are always a thousand ways to do things in R. I just posted my own question–answer pair for another approach that I found: https://stackoverflow.com/questions/52736118/adding-column-names-to-vars-inside-a-dplyr-function/52736119#52736119 – slhck Oct 10 '18 at 08:45
  • @slhck: Nice! Thanks for sharing! – Tung Oct 10 '18 at 08:53
  • 1
    I changed the accepted answer for a more "modern" dplyr approach (and gave the points to a new contributor as encouragement). Hope you understand. – slhck Sep 06 '21 at 16:15
3

You could make use of the ellipse .... Take the following example:

sum_fun <- function(df, sum_var, ...) {
  sum_var <- substitute(sum_var)
  grps    <- substitute(list(...))[-1L]
  return(
    df %>% 
      group_by_at(.vars = as.character(grps)) %>% 
      summarize(sum(!! sum_var))
  )
}

d %>% sum_fun(baz, foo, bar)

We take the additional arguments and create a list out of them. Afterwards we use non-standard evaluation (substitute) to get the variable names and prevent R from evaluating them. Since group_by_at expects an object of type character or numeric, we simply convert the vector of names into a vector of characters and the function gets evaluated as we would expect.

> d %>% sum_fun(baz, foo, bar)
# A tibble: 3 x 3
# Groups:   foo [?]
    foo   bar `sum(baz)`
  <dbl> <dbl>      <dbl>
1     1     2          3
2     1     3          5
3     4     5          7

If you do not want to supply grouping variables as any number of additional arguments, then you can of course use a named argument:

sum_fun <- function(df, sum_var, grps) {
  sum_var <- enquo(sum_var)
  grps    <- as.list(substitute(grps))[-1L]
  return(
    df %>% 
      group_by_at(.vars = as.character(grps)) %>% 
      summarize(sum(!! sum_var))
  )
}

sum_fun(mtcars, sum_var = hp, grps = c(cyl, gear))

The reason why I use substitute is that it makes it easy to split the expression list(cyl, gear) in its components. There might be a way to use rlang but I have not digged into that package so far.

Martin Schmelzer
  • 23,283
  • 6
  • 73
  • 98
  • Thanks for the answer. Problem is, I have other arguments in the function, or I may have additional ones. I guess I can move around the arguments if that is the only option. Is there any difference between `substitute` and `enquo`? – slhck Oct 09 '18 at 13:08
  • Updated my anwer. – Martin Schmelzer Oct 09 '18 at 13:41
3

You can rewrite the function using a combination of dplyr::group_by(), dplyr::across(), and curly curly embracing {{. This works with dplyr version 1.0.0 and greater.

I've edited the original example and code for clarity.

library(tidyverse)

my_data <- tribble(
  ~foo, ~bar, ~baz,
   "A",  "B",    3,
   "A",  "C",    5,
   "D",  "E",    6,
   "D",  "E",    1
)

sum_fun <- function(.data, group, sum_var) {
    .data %>% 
      group_by(across({{ group }})) %>% 
      summarize("sum_{{sum_var}}" := sum({{ sum_var }}))
}

sum_fun(my_data, group = c(foo, bar), sum_var = baz)
#> `summarise()` has grouped output by 'foo'. You can override using the `.groups` argument.
#> # A tibble: 3 x 3
#> # Groups:   foo [2]
#>   foo   bar   sum_baz
#>   <chr> <chr>   <dbl>
#> 1 A     B           3
#> 2 A     C           5
#> 3 D     E           7

Created on 2021-09-06 by the reprex package (v2.0.0)

  • This is so much cleaner! It would be good if you could mention in which version of dplyr this works. (My follow-up Q&A could also benefit from this simplification: https://stackoverflow.com/a/52736119/435093) – slhck Sep 06 '21 at 16:14
  • Edited! Answered your follow-up with this simplification too. – Michael McCarthy Sep 06 '21 at 17:40