1

I am trying to understand the expected output of dplyr::group_by() in conjunction with the use of dplyr::all_of(). My understanding is that using dplyr::all_of() should convert character vectors containing variable names to the bare names so that group_by(), but this doesn't appear to happen.

Below, I generate some fake data, pass different objects to group_by() with(out) all_of() and calculate the number of observations in each group. In the example, passing a single bare column name without dplyr::all_of() produces the correct output: one row per unique value of the column. However, passing character vectors or using dplyr::all_of() produces incorrect output: one row regardless of the number of values in a column.

What is expected when using all_of and how might I alternatively pass a character vector to group_by to process as a vector of bare names?

library(dplyr)

# Create a 20-row data.frame with
# 2 variables each with 2 unique values.
df <- data.frame(var = rep(c("a", "b"), 10),
                 bar = rep(c(1, 2), 20))

# Output 1: 2x2 tibble - GOOD
df %>% group_by(var) %>% summarize(n = n())

# Output 2: 1x2 tibble - BAD
foo <- "var"
df %>% group_by(all_of(foo)) %>% summarize(n = n())

# Output 3: 1x2 tibble
df %>% group_by("var") %>% summarize(n = n())

# Output 4: Error in_var not found - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
  df %>%
    group_by(in_var) %>%
    summarize(n = n())
})

# Output 5: list of length 2 where
# each element is a 1x2 tibble - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
  df %>%
    group_by(all_of(in_var)) %>%
    summarize(n = n())
})
socialscientist
  • 3,759
  • 5
  • 23
  • 58
  • 2
    an option for your second question is using `group_by` with `across` if you're using dplyr >= 1.0.0 – EJJ May 14 '21 at 19:18
  • 2
    I think you want `df %>% group_by(across(all_of(foo))) %>% summarize(n = n())` with the latest versions of `dplyr` – MrFlick May 14 '21 at 19:20

2 Answers2

2

We can use group_by_at

lapply(foo2, function(in_var) df %>% 
      group_by_at(all_of(in_var)) %>% 
      summarise(n = n()))

-output

#[[1]]
# A tibble: 2 x 2
#  var       n
#* <chr> <int>
#1 a        20
#2 b        20

#[[2]]
# A tibble: 2 x 2
#    bar     n
#* <dbl> <int>
#1     1    20
#2     2    20

As across replaces some of the functionality of group_by_at, we can use it instead with all_of:

lapply(foo2, function(in_var) df %>% 
      group_by(across(all_of(in_var))) %>% 
      summarise(n = n()))

Or convert to symbol and evaluate (!!)

lapply(foo2, function(in_var) df %>% 
      group_by(!! rlang::sym(in_var)) %>% 
      summarise(n = n()))

Or use map

library(purrr)
map(foo2, ~ df %>%
              group_by(!! rlang::sym(.x)) %>%
              summarise(n = n()))

Or instead of group_by, it can be count

map(foo2, ~ df %>%
              count(across(all_of(.x))))
                       
socialscientist
  • 3,759
  • 5
  • 23
  • 58
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This is great, thanks. I think the issue was in my understanding of why `across()` is necessary. Still not clear to me why `group_by` needs `across()` to use `all_of` appropriately. I know this is a scoping issue. If there is a good link to read for how different functions scope in `dplyr`, would be welcome. – socialscientist May 14 '21 at 19:29
  • @user3614648 you may check the `vignette` of dplyr which will have information regarding those new functionalities – akrun May 14 '21 at 19:34
1

To add to @akrun's answers of mutliple ways to achieve the desired output - my understanding of all_of() is that, it is a helper for selection of variables stored as character for dplyr function and uses vctrs underneath. Compared to any_of() which is a less strict version of all_of() and some convenient use cases. reading the ?tidyselect::all_off() is helpful. This page is also helpful to keep up with changes in dplyr and tidy evaluation https://dplyr.tidyverse.org/articles/programming.html.

The scoped dplyr verbs are being superceded in the future with across based on decisions by the devs at RStudio. See ?group_by_at() or other *_if, *_at, *_all documentation. So I guess it really depends on what version of dplyr you are using in your workflow and what works best for you.

This SO post also gives context of changes in solutions over time with passing characters into dplyr functions, and there's probably more posts out there.

EJJ
  • 1,474
  • 10
  • 17