Translating filter_all(any_vars()) to filter(across())

Question

On updating my own answer to another thread, I wasn't able to come up with a good solution to replace the last example (see below). The idea is to get all rows where any column contains a certain string, in my example "V".

library(tidyverse)

#get all rows where any column contains 'V'
diamonds %>%
  filter_all(any_vars(grepl('V',.))) %>%
  head
#> # A tibble: 6 x 10
#>   carat cut       color clarity depth table price     x     y     z
#>   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#> 2 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#> 3 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#> 4 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
#> 5 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
#> 6 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49


# this does naturally not give the desired output! 
diamonds %>%
  filter(across(everything(), ~ grepl('V', .))) %>%
  head
#> # A tibble: 0 x 10

I found a thread where the poster ponders over similar stuff, but applying a similar logic on grepl does not work.

### don't run, this is ugly and does not work
diamonds %>%
  rowwise %>%
  filter(any(grepl("V", across(everything())))) %>%
  head

If I check the dataset, it is `ordered` ie. factor column and BTW, you need `c_across`, but still it would have an issue with the types — akrun, Jul 13 '20 at 22:34
This is very difficult, because the example shows that you want to filter data from _all_ columns when _any_ of them satisfies the condition (i.e. you want a _union_). That's done with `filter_all()` and `any_vars()`. While `filter(across(everything(), ...))` filters out from all columns when all of them satisfy the condition (i.e. this is a _intersection_, quite opposite of what you originaly wanted). Do I understand it correctly? — , Jul 13 '20 at 22:56
@PetrKajzar That's why I am asking this question! If you have an idea, please post it :) — tjebo, Jul 13 '20 at 22:59
@Tjebo Wow, nice question then! OK, I tried to post some idea. — , Jul 13 '20 at 23:06

score 8 · Accepted Answer · 2020-07-14T10:52:23.583

This is very difficult, because the example shows that you want to filter data from all columns when any of them meets the condition (i.e. you want a union). That's done with filter_all() and any_vars().

While filter(across(everything(), ...)) filters out from all columns when all of them meet the condition (i.e. this is a intersection, quite opposite of the previous).

To convert it from intersection to the union (i.e. to get again rows where any of the columns meet the condition), you probably need to check the row sum for that:

diamonds %>%
   filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)

It will sum all the TRUEs that appear in the row, i.e. if there is at least one value meeting the condition, that row sum will be > 0 and will be shown.

I'm sorry for across() is not the very first child of filter(), but it's at least some idea how to do that. :-)

Evaluation:

Using @TimTeaFan's method to check that:

 identical(
     {diamonds %>%
         filter_all(any_vars(grepl('V',.)))
     }, 
     {diamonds %>%
         filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)
     }
 )
 #> [1] TRUE

Benchmark:

As per our discussion under TimTeaFan's answer, here is a comparison, surprisingly, all solutions have a similar time:

library(tidyverse)
microbenchmark::microbenchmark(
  filter_all = {diamonds %>%
      filter_all(any_vars(grepl('V',.)))}, 
  purrr_reduce = {diamonds %>%
      filter(across(everything(), ~ grepl('V', .)) %>% purrr::reduce(`|`))},
  base_reduce = {diamonds %>%
      filter(across(everything(), ~ grepl('V', .)) %>% Reduce(`|`, .))},
  rowsums = {diamonds %>%
      filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)},
  times = 100L,
  check = "identical"
)
#> Unit: milliseconds
#>          expr      min       lq     mean   median       uq      max neval
#>    filter_all 295.7235 302.1311 309.6455 305.0491 310.0335 449.3619   100
#>  purrr_reduce 297.8220 302.4411 310.2829 306.2929 312.2278 461.0194   100
#>   base_reduce 298.5033 303.6170 309.4147 306.1839 312.3518 409.5273   100
#>       rowsums 295.3863 301.0281 307.8517 305.3142 309.4793 372.8867   100

^{Created on 2020-07-14 by the reprex package (v0.3.0)}

You might add a note about `na.rm = TRUE` in the rowSums function to account for missing values. This tripped me up. — Jeff Parker, Feb 11 '21 at 23:05

TimTeaFan · Answer 2 · 2020-07-13T22:52:29.787

This is the equivalent to the filter_all call you posted. However, @akrun is totally correct to point out, that it should be converted to character first. Nevertheless, this also holds true for your filter_all statement.

The idea is to use across(everything(), ~ grepl('V', .)) to get the whole data.frame transformed into columns of TRUE and FALSE regarding grepl('V', .). However, filter needs a vector, or a data.frame with one column so we transform it by using reduce(|). It combines the first two columns with | then the result of this call with the third column and so on, until the original data.frame has one column with TRUE and FALSE which can then be used to filter the rows.

library(ggplot2)
library(dplyr)

diamonds %>%
  filter(across(everything(), ~ grepl('V', .)) %>% purrr::reduce(`|`)) %>% 
  head
#> # A tibble: 6 x 10
#>   carat cut       color clarity depth table price     x     y     z
#>   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#> 2 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#> 3 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#> 4 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
#> 5 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
#> 6 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49

identical({diamonds %>%
            filter_all(any_vars(grepl('V',.)))}, 
          {diamonds %>%
            filter(across(everything(), ~ grepl('V', .)) %>% purrr::reduce(`|`))
            })
#> [1] TRUE

^{Created on 2020-07-14 by the reprex package (v0.3.0)}

Man, it took me a while to recognize the backticks as backticks! That's a real cool use of reduce and very clever!! However, I find @petrkajzar's solution with rowSums very elegant (no need for call to another package, and somewhat shorter in code, and kind of immediately intuitive, but I haven't benchmarked it!) — tjebo, Jul 14 '20 at 06:46
I like the rowSums approach too, it’s more concise and it should also be faster. — TimTeaFan, Jul 14 '20 at 07:35
TimTeaFan, what a nice use of reduce, good approach! @Tjebo I have tried to benchmark that, see my updated answer. Suprisingly, none of the solutions is significantly faster than the others! :-) — , Jul 14 '20 at 10:38
@PetrKajzar yet, your solution is still fastest! I think this may become more visible with lot bigger data :) — tjebo, Jul 14 '20 at 15:15

score 1 · Answer 3 · answered Jul 13 '20 at 22:39

Some of the columns were ordered and it will affect with c_across. Instead, if we convert to character class and then do the grepl it should work

library(dplyr)
library(ggplot2)
diamonds %>%
    head %>% 
    mutate(across(where(is.factor), as.character)) %>% 
    rowwise %>% 
    filter(any(grepl("V", c_across(where(is.character)))))
# A tibble: 3 x 10
# Rowwise: 
#  carat cut       color clarity depth table price     x     y     z
#  <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#1 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#2 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#3 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Thanks Akrun! Not sure I like this change in this case... :( I find filter_all somehow much more concise. Maybe worth an issue on github — tjebo, Jul 13 '20 at 22:52
@Tjebo I think this is under development and at least 2 or 3 questions in the recent days have a similar issue. Probably they will fix on create a new feature — akrun, Jul 13 '20 at 22:55

Translating filter_all(any_vars()) to filter(across())

3 Answers3