Dynamically select multiple columns whose names are stored as variables

Question

I would like a function to be able to accept a tibble and a character vector indicating the column names of a variable number of columns in that tibble, and perform some operations such as group_by on it.

Here is an example that does it for 0, 1, or 2 columns:

library(tidyverse)

ex = crossing(abc=LETTERS[1:3], xyz=LETTERS[24:26]) %>% mutate(n = row_number())

group_flexibly = function(tbl, group_by_cols=character(0)) {
  if (length(group_by_cols)==0) {
    tbl %>%
      summarize(.groups='keep', mean_n = mean(n))
  } else if (length(group_by_cols)==1) {
    tbl %>%
      group_by(!!as.name(group_by_cols[1])) %>%
      summarize(.groups='keep', mean_n=mean(n))
  } else if (length(group_by_cols)==2) {
    tbl %>%
      group_by(!!as.name(group_by_cols[1]), !!as.name(group_by_cols[2])) %>%
      summarize(.groups='keep', mean_n=mean(n))
  }
}

group_flexibly(ex)
group_flexibly(ex, 'abc')
group_flexibly(ex, 'xyz')
group_flexibly(ex, c('abc','xyz'))

Output is as desired:

> group_flexibly(ex)
# A tibble: 1 × 1
  mean_n
   <dbl>
1      5
> group_flexibly(ex, 'abc')
# A tibble: 3 × 2
# Groups:   abc [3]
  abc   mean_n
  <chr>  <dbl>
1 A          2
2 B          5
3 C          8
> group_flexibly(ex, 'xyz')
# A tibble: 3 × 2
# Groups:   xyz [3]
  xyz   mean_n
  <chr>  <dbl>
1 X          4
2 Y          5
3 Z          6
> group_flexibly(ex, c('abc','xyz'))
# A tibble: 9 × 3
# Groups:   abc, xyz [9]
  abc   xyz   mean_n
  <chr> <chr>  <dbl>
1 A     X          1
2 A     Y          2
3 A     Z          3
4 B     X          4
5 B     Y          5
6 B     Z          6
7 C     X          7
8 C     Y          8
9 C     Z          9

So far so good. Now, how to write such a function that does this for a character vector of arbitrary length?

Here are two things that do not work:

group_by_cols = c('abc','xyz')
ex %>% group_by(!!as.name(group_by_cols)) %>% summarize(.groups='keep', mean_n=mean(n))
ex %>% group_by({{group_by_cols}}) %>% summarize(.groups='keep', mean_n=mean(n))

Problems encountered so far:

!!as.name(group_by_cols) only uses group_by_cols[1] and ignores the rest of the vector.
{{group_by_cols}} throws an error if length(group_by_cols) != 1.
Popular StackOverflow discussions such as this do not address a need for the length of the vector of column names to be variable.

joran · Accepted Answer · 2023-05-25T14:11:03.337

3

You're looking for across() and all_of():

group_flexibly <- function(tbl,grp_cols = character(0)){
  tbl |>
    group_by(across(all_of(grp_cols))) |>
    summarise(mean_n = mean(n),.groups = 'keep')
}

The default value of character(0) handles the case of not providing any value to grp_cols.

I actually recently learned that a somewhat preferred version is to use pick() instead of across(), the difference being that if grp_cols is a named vector it will create new columns using those names. Using pick(all_of(grp_cols)) or the .by argument suggested in a comment would both error on a named vector.

edited May 25 '23 at 14:11

answered May 24 '23 at 19:55

joran

169,992
32
429
468

1

Also works well with the new-ish `.by` argument, `tbl |> summarize(mean_n = mean(n), .by = all_of(group_by_cols))` – Gregor Thomas May 24 '23 at 20:00

Dynamically select multiple columns whose names are stored as variables

1 Answers1