3

I have the same question answered here R - Find all vector elements that contain all strings / patterns - str_detect grep. But the suggested solution is taking too long.

I have 73,360 observations with sentences. I want a TRUE return for matches that contain ALL search strings.

sentences <- c("blue green red",
               "blue green yellow",
               "green red  yellow ")
search_terms <- c("blue","red")

pattern <- paste0("(?=.*", search_terms,")", collapse="") 
grepl(pattern, sentences, perl = TRUE)

-output

[1]  TRUE FALSE FALSE

This gives the right result, but it takes a very very very long time. Is there a faster way? I tried str_detect and got same delayed result.

BTW the "sentences" contain special characters like [],.- but no special characters like ñ.

UPDATED: below are my bemchmark results using the suggested methods, thanks to @onyambu's input.

Unit: milliseconds
                  expr       min        lq      mean    median        uq      max neval
         OP_solution() 7033.7550 7152.0689 7277.8248 7251.8419 7391.8664 7690.964   100
      map_str_detect() 2239.8715 2292.1271 2357.7432 2348.9975 2397.1758 2774.349   100
 unlist_lapply_fixed()  308.1492  331.9948  345.6262  339.9935  348.9907  586.169   100

Reduce_lapply winnnnssss! Thanks @onyambu

Unit: milliseconds
                  expr       min        lq      mean    median        uq       max neval
       Reduce_lapply()  49.02941  53.61291  55.96418  55.31494  56.76109  80.64735   100
 unlist_lapply_fixed() 318.25518 335.58883 362.03831 346.71509 357.97142 566.95738   100
guasi
  • 1,461
  • 3
  • 12
  • Would every sentence always have 3 colors? Can you change the data structure of `sentences`? – Tim Biegeleisen May 22 '22 at 05:42
  • @TimBiegeleisen, the example was just for simple illustration. I'm searching the character strings describing medical diagnoses, which are more than 73K, each diagnosis has one or two description sentences. – guasi May 22 '22 at 06:59
  • Consider _normalizing_ your data structure such that each color appears in a separate row. – Tim Biegeleisen May 22 '22 at 07:00
  • 1
    Check the Reduce answer I provided. Its faster than the rest. I believe as per the speed, this is the fasted and you should consider it. – Onyambu May 22 '22 at 07:12

2 Answers2

3

EDIT: Another option is to loop around the search pattern instead of looping through the sentences:

use:

Reduce("&", lapply(search_terms, grepl, sentences, fixed = TRUE))
[1]  TRUE FALSE FALSE

benchmark

Unit: milliseconds
                  expr      min        lq      mean    median        uq       max neval
         OP_solution()  80.6365  81.61575  85.76427  83.20265  87.32975  163.0302   100
      map_str_detect() 546.4681 563.08570 596.26190 571.52185 603.03980 1383.7969   100
 unlist_lapply_fixed()  61.8119  67.49450  71.41485  69.56290  73.77240  104.8399   100
       Reduce_lapply()   3.0604   3.11205   3.406012   3.14535   3.43130   6.3526   100

Note that this is amaxingly fast!

OLD POST:

Make use of the all function as shown below:

unlist(lapply(strsplit(sentences, " ", fixed = TRUE), \(x)all(search_terms %in% x)))

the bencmark:

OP_solution <- function(){
   pattern <- paste0("(?=.*", search_terms,")", collapse="") 
   grepl(pattern, sentences, perl = TRUE)
}

map_str_detect <- function(){
    purrr::map_lgl(
      .x = sentences,
      .f = ~ all(stringr::str_detect(.x, search_terms))
    )
}

unlist_lapply_fixed <- function() unlist(lapply(strsplit(sentences, " ", fixed = TRUE), \(x)all(search_terms %in% x)))


sentences <- rep(sentences, 10000)
microbenchmark::microbenchmark( OP_solution(),map_str_detect(),
                   unlist_lapply_fixed(), check = 'equal')
Unit: milliseconds
                  expr      min        lq      mean    median        uq      max neval
         OP_solution()  80.5368  81.40265  85.14451  82.73985  86.41345 118.7052   100
      map_str_detect() 542.3555 553.84080 587.15748 566.66570 607.77130 782.5189   100
 unlist_lapply_fixed()  60.4955  66.94420  71.94195  69.30135  72.16735 113.6567   100

    
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • Thank you @onyambu. I've used both suggestions and I can count all the way to four Mississippis!! Although the last one you suggested seems faster. I'm searching 73K observations, each with one or two sentences. Doesn't seem like much, but the delay is noticeable. Using the `|` operator returns results immediately. – guasi May 22 '22 at 06:09
  • @guasi What do you mean by using `|`? – Onyambu May 22 '22 at 06:18
  • The OR operator `|`. If I search for any of the search terms `(blue)|(red)`, results are immediate. If I search for all terms, using any of the suggested methods, results are noticeably delayed. – guasi May 22 '22 at 06:38
  • @guasi Note that using `|` does not give the desired results. That is searching for either blue or red instead of searching for both. Check the `Reduce` method I posted. – Onyambu May 22 '22 at 06:40
  • THANK YOU @onyambu!! You're amazing! The `Reduce_lapply` method is amazingly fast! That's what i'll use. – guasi May 22 '22 at 07:16
0

you could potentially try a mix of purrr and stringr functions to solve:

library(tidyverse)

purrr::map_lgl(
  .x = sentences,
  .f = ~ all(stringr::str_detect(.x, search_terms))
)
CourtesyBus
  • 331
  • 2
  • 4
  • Same problem, it takes a long time to perform. I'm looking for something that takes less time to perform. – guasi May 22 '22 at 05:34