Getting "Can't subset columns that don't exist." when the column does exist with dplyr

Question

I have have the variable 'county' listed as a column but when I try to aggregate it using group_by_across in this manner:

testing4 <- testing2 %>%
        group_by(across(-c(county, population))) %>%
        summarise(pop=sum(population))

it gives me:

Error: Problem with `mutate()` input `..1`.
x Can't subset columns that don't exist.
x Column `county` doesn't exist.
Input `..1` is `across(-c(county, population))`.
i The error occurred in group 1: year = 1980, state = "AK", stfips = 2, 
county = 2900.
Run `rlang::last_error()` to see where the error occurred.

However, when I do

testing3 <- testing2 %>%
        group_by(year, state, stfips, race) %>%
        summarise(pop = sum(population))

it runs fine.

Edit: Someone asked for dput(head(testing2))

dput(head(testing2))
structure(list(year = c(1980L, 1980L, 1980L, 1980L, 1980L, 1980L
), state = c("AK", "AK", "AK", "AL", "AL", "AL"), stfips = c(2L, 
2L, 2L, 1L, 1L, 1L), county = c(2900L, 2900L, 2900L, 1001L, 1001L, 
1001L), race = c(1L, 2L, 3L, 1L, 2L, 3L), population = c(318054L, 
13960L, 72666L, 24876L, 7193L, 148L)), row.names = c(NA, -6L), groups = 
structure(list(
year = c(1980L, 1980L), state = c("AK", "AL"), stfips = 2:1, 
county = c(2900L, 1001L), .rows = structure(list(1:3, 4:6), ptype = 
integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = 1:2, class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

Seems like the `across()` might be causing the problem. You're not using that in the working version. It would be easier to help if you provided a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired ouput. Feel free to use a built in dataset for your example rather than your own data. — MrFlick, Sep 06 '20 at 23:36
When I do `iris %>% group_by(across(-c(Sepal.Length, -Petal.Length)))` it seems to work. So maybe it really is your particular data.frame. Share a `dput()`, just the first few rows would be fine. `dput(head(testing2))` — MrFlick, Sep 06 '20 at 23:40
I see the dataset you are working with is already grouped. Have you tried ungrouping before doing more work? — aosmith, Sep 06 '20 at 23:58
See `https://github.com/tidyverse/dplyr/issues/5253` - using `across()` with `group_by()` is designed to fail if data is already grouped by one of those vars. Error message not particularly helpful though. — Ritchie Sacramento, Sep 06 '20 at 23:59

Corrado · Answer 1 · 2020-09-07T07:09:31.180

and welcome here.

When you group a tibble, all the functions applied after grouping use the grouped data, excluding the grouping variables (?group_by). In fact, you can use/access that (each-group temporary-new) data inside that function using cur_group() (?cur_across).

So, step 1: when you use across() in a grouped tibble (as yours), that uses, for each group, the data without the grouping variables. across() (without the .fnc argument, default = NULL; ?across) returns the listed variables without modification, starting from the input data, which, in your case, does not have the old grouping variables! Hence, you cannot use a grouping variable inside across() for a grouped tibble.

But, step 2: you can also consider that group_by() overrides itself (see examples in ?group_by).

Combining the two, you don't need to list a variable you want to exclude if it is already a grouping one. If you're going to (re-)group a tibble based on other variables: you can remove the additional ones you do not want to use! Those variables (along with the other used for the previous grouping) are already excluded at the time you compute the new groups. When the new group_by (evaluating across() "by groups"; ie, without the grouping variables) join the results, it returns the whole tibble grouped without the previous grouping variables and without the ones you have just "added" to the exclusion.

One problem can arise if you would like to re-group a grouped tibble excluding other variables but keeping (a subset of) the grouping ones. Anyway, in those cases, you can list those "maintained" grouping variables outside the call to across() into the call of group_by() (which by itself do not "compute" anything (opposite to across()) and so it does not use the parts of the grouped tibble (which do not have the old grouping variables)). That way the last group_by() create a grouped tibble "with all the variables that are not in the old grouping variables, that are not listed in the new excluded ones, plus the (old-)ones reported outside across()."

Here a running (reproducible) example:

# install.packages("tidyverse")
# install.packages("palmerpenguins")

library(tidyverse)
library(palmerpenguins)

penguins
#> # A tibble: 344 x 8
#>    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#>    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
#>  1 Adelie  Torge…           39.1          18.7              181        3750
#>  2 Adelie  Torge…           39.5          17.4              186        3800
#>  3 Adelie  Torge…           40.3          18                195        3250
#>  4 Adelie  Torge…           NA            NA                 NA          NA
#>  5 Adelie  Torge…           36.7          19.3              193        3450
#>  6 Adelie  Torge…           39.3          20.6              190        3650
#>  7 Adelie  Torge…           38.9          17.8              181        3625
#>  8 Adelie  Torge…           39.2          19.6              195        4675
#>  9 Adelie  Torge…           34.1          18.1              193        3475
#> 10 Adelie  Torge…           42            20.2              190        4250
#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

penguins %>% 
    group_by(species, island) %>% 
    group_by(across(-c(
        starts_with("bill"),
        starts_with("flipper"),
        starts_with("body")
    ))) # species and island are already exluded
#> # A tibble: 344 x 8
#> # Groups:   sex, year [9]
#>    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#>    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
#>  1 Adelie  Torge…           39.1          18.7              181        3750
#>  2 Adelie  Torge…           39.5          17.4              186        3800
#>  3 Adelie  Torge…           40.3          18                195        3250
#>  4 Adelie  Torge…           NA            NA                 NA          NA
#>  5 Adelie  Torge…           36.7          19.3              193        3450
#>  6 Adelie  Torge…           39.3          20.6              190        3650
#>  7 Adelie  Torge…           38.9          17.8              181        3625
#>  8 Adelie  Torge…           39.2          19.6              195        4675
#>  9 Adelie  Torge…           34.1          18.1              193        3475
#> 10 Adelie  Torge…           42            20.2              190        4250
#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>


penguins %>% 
    group_by(species, island) %>% 
    group_by(
        across(-c(
            starts_with("bill"),
            starts_with("flipper"),
            starts_with("body")
        )),
        species # "continue" to use species for grouping
    )
#> # A tibble: 344 x 8
#> # Groups:   sex, year, species [22]
#>    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#>    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
#>  1 Adelie  Torge…           39.1          18.7              181        3750
#>  2 Adelie  Torge…           39.5          17.4              186        3800
#>  3 Adelie  Torge…           40.3          18                195        3250
#>  4 Adelie  Torge…           NA            NA                 NA          NA
#>  5 Adelie  Torge…           36.7          19.3              193        3450
#>  6 Adelie  Torge…           39.3          20.6              190        3650
#>  7 Adelie  Torge…           38.9          17.8              181        3625
#>  8 Adelie  Torge…           39.2          19.6              195        4675
#>  9 Adelie  Torge…           34.1          18.1              193        3475
#> 10 Adelie  Torge…           42            20.2              190        4250
#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

^{Created on 2020-09-07 by the reprex package (v0.3.0)}

sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=it_IT.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices datasets  utils     methods   base     
#> 
#> other attached packages:
#>  [1] palmerpenguins_0.1.0 forcats_0.5.0        stringr_1.4.0       
#>  [4] dplyr_1.0.2          purrr_0.3.4          readr_1.3.1         
#>  [7] tidyr_1.1.2          tibble_3.0.3         ggplot2_3.3.2       
#> [10] tidyverse_1.3.0     
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.1.0 xfun_0.16        haven_2.3.1      colorspace_1.4-1
#>  [5] vctrs_0.3.4      generics_0.0.2   htmltools_0.5.0  yaml_2.2.1      
#>  [9] utf8_1.1.4       blob_1.2.1       rlang_0.4.7      pillar_1.4.6    
#> [13] glue_1.4.2       withr_2.2.0      DBI_1.1.0        dbplyr_1.4.4    
#> [17] modelr_0.1.8     readxl_1.3.1     lifecycle_0.2.0  munsell_0.5.0   
#> [21] gtable_0.3.0     cellranger_1.1.0 rvest_0.3.6      evaluate_0.14   
#> [25] knitr_1.29       fansi_0.4.1      highr_0.8        broom_0.7.0     
#> [29] Rcpp_1.0.5       renv_0.12.0      scales_1.1.1     backports_1.1.9 
#> [33] jsonlite_1.7.0   fs_1.5.0         hms_0.5.3        digest_0.6.25   
#> [37] stringi_1.4.6    grid_4.0.2       cli_2.0.2        tools_4.0.2     
#> [41] magrittr_1.5     crayon_1.3.4     pkgconfig_2.0.3  ellipsis_0.3.1  
#> [45] xml2_1.3.2       reprex_0.3.0     lubridate_1.7.9  assertthat_0.2.1
#> [49] rmarkdown_2.3    httr_1.4.2       R6_2.4.1         compiler_4.0.2

score 3 · Answer 2 · answered Sep 07 '20 at 00:19

3

Looks like inserting an ungroup() as the second step will work: for a minimal comparison, compare

testing2 %>% group_by(across(-county))

and

testing2 %>% ungroup() %>% group_by(across(-county))

answered Sep 07 '20 at 00:19

Ben Bolker

211,554
25
370
453

Getting "Can't subset columns that don't exist." when the column does exist with dplyr

2 Answers2