4

I'm trying to clean up a sample information sheet that comes from a lot of different groups and thus the treatment information I care about may be located in any number of different columns. Here's an abstracted example:

sample_info = tribble(
  ~id, ~could_be_here, ~or_here,    ~or_even_in_this_one,
  1,   NA,             "not_me",    "find_me_other_stuff",
  2,   "Extra_Find_Me", NA,         "diff_stuff",
  3,   NA,              "Find_me",  NA,
  4,   NA,              "not_here", "not_here_either"
)

where I would want to find "find_me" 1) case-insensitively, 2) where it could be in any column, and 3) where it could be as part of a larger string. I want to create one column that's TRUE or FALSE for whether "find_me" was found in any columns. How can I do this? (I've thought of uniteing all columns and then just running a str_detect on that mess, but there must be a less hacky way, right?)

To be clear, I would want a final tibble that's equivalent to sample_info %>% mutate(find_me = c(TRUE, TRUE, TRUE, FALSE)).

I expect that I would want to use something like stringr::str_detect(., regex('find_me', ignore_case = T)) and pmap_lgl(any(c(...) <insert logic check>)) like in the similar cases linked below, but I'm not sure how to put them together into a mutate-compatible statement.

Things I've looked through:
Row-wise operation to see if any columns are in any other list

R: How to ignore case when using str_detect?

in R, check if string appears in row of dataframe (in any column)

GenesRus
  • 1,057
  • 6
  • 16

4 Answers4

5

One dplyr and purrr option could be:

sample_info %>%
 mutate(find_me = pmap_lgl(across(-id), ~ any(str_detect(c(...), regex("find_me", ignore_case = TRUE)), na.rm = TRUE)))

     id could_be_here or_here  or_even_in_this_one find_me
  <dbl> <chr>         <chr>    <chr>               <lgl>  
1     1 <NA>          not_me   find_me_other_stuff TRUE   
2     2 Extra_Find_Me <NA>     diff_stuff          TRUE   
3     3 <NA>          Find_me  <NA>                TRUE   
4     4 <NA>          not_here not_here_either     FALSE

Or with just using dplyr:

sample_info %>%
 rowwise() %>%
 mutate(find_me = any(str_detect(c_across(-id), regex("find_me", ignore_case = TRUE)), na.rm = TRUE))
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
3

I hope I got what you have in mind right. This is how I find all find_mes across multiple columns:

library(dplyr)
library(purrr)
library(stringr)

sample_info = tribble(
  ~id, ~could_be_here, ~or_here,    ~or_even_in_this_one,
  1,   NA,             "not_me",    "find_me_other_stuff",
  2,   "Extra_Find_Me", NA,         "diff_stuff",
  3,   NA,              "Find_me",  NA,
  4,   NA,              "not_here", "not_here_either"
)

sample_info %>%
  mutate(find_me_exist = if_any(, ~ str_detect(., regex("find_me", ignore_case = TRUE), )
                                , .names = "{.col}.fn{.fn}"))

# A tibble: 4 x 5
     id could_be_here or_here  or_even_in_this_one find_me_exist
  <dbl> <chr>         <chr>    <chr>               <lgl>        
1     1 NA            not_me   find_me             TRUE         
2     2 Extra_Find_me NA       diff_stuff          TRUE         
3     3 NA            find_Me  NA                  TRUE         
4     4 NA            not_here not_here_either     FALSE

Sorry I had to edit my code so that it is not case sensitive.

Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • Could there have been a typo? This is producing a column of FALSEs for me. – GenesRus Mar 23 '21 at 02:26
  • @GenesRus really?! I checked it again right now and the output is the exact same thing I put above. I also copied my codes from here to my R script. Check it again there might have been a mistake. – Anoushiravan R Mar 23 '21 at 08:45
  • Maybe it's version-dependent? I just ran everything in the code block here in a new session to be sure and it's still outputting all FALSEs. My R is 4.0.4. Here are my package versions: stringr_1.4.0, purrr_0.3.4, dplyr_1.0.4 Maybe we've found a bug? – GenesRus Mar 23 '21 at 15:30
  • My dplyr version is 1.0.5 and everything else in my libraries is the exact same versions that you have. Please update your dplyr and let me know. Apart from that I have honestly no explanation for this. It's a bit weird. – Anoushiravan R Mar 23 '21 at 15:39
  • I ran the codes again and got 3 TRUEs and 1 FALSE just like before. – Anoushiravan R Mar 23 '21 at 15:40
  • Yep, that fixed it! For whatever reason, it doesn't work with dplyr_1.0.4. I suspect it is related to this note in the changelog: "The .cols= argument of if_any() and if_all() defaults to everything()" implying my eval wasn't looking at all columns (presumably just id or possibly none of them?), though it could also be related to the bugs they fixed in `across` since `if_any` uses the "same predicate function" as `across` according to the documentation. Glad we found the explanation and thanks for your answer! – GenesRus Mar 23 '21 at 15:57
  • 1
    Oh we finally unraveled the mystery! I'm glad it was solved. I've just learned through the past months to keep at least my tidyverse packages up to date cause I sometimes came across pretty surprising results across different versions with no explanation. Your welcome that was my pleasure. – Anoushiravan R Mar 23 '21 at 16:01
2

This is the typical use case for dplyr::if_any. if_any of the selected columns has a match, the new columns outputs to TRUE. Use regex() with the argument ignore_case = TRUE for a case-insensitive match.

library(dplyr)
library(stringr)

sample_info |> 
    mutate(find_me = if_any(-id,\(x) str_detect(x, regex("find_me", ignore_case = TRUE))))

# A tibble: 4 × 5
     id could_be_here or_here  or_even_in_this_one find_me
  <dbl> <chr>         <chr>    <chr>               <lgl>  
1     1 NA            not_me   find_me_other_stuff TRUE   
2     2 Extra_Find_Me NA       diff_stuff          TRUE   
3     3 NA            Find_me  NA                  TRUE   
4     4 NA            not_here not_here_either     NA     
GuedesBF
  • 8,409
  • 5
  • 19
  • 37
2

In case you did want to try the hacky way, your idea of using unite does actually work:

 sample_info %>% unite(new, remove = FALSE) %>% 
    mutate(found = str_detect(.$new, regex("find_me", ignore_case = TRUE))) %>% 
    select(-new)
awaji98
  • 685
  • 2
  • 6