0

I have a dataset with with a field of interest and a list of strings (several hundred of them).

What I want to do is, for each line of the data, to check if the field has any of the partials strings in it.

Essentially, I want to replicate the SQL % wildcard. So, if for example a value is "Game123" and one of my strings is "Ga" I want that to be a match. (But I don't want "OGame" to match "Ga").

I'm hoping to write some statement like this:

df%>%
filter(My_Field contains any one of List_Of_Strings)

How do I fill in that filter statement?

I tried to use the %in% operator but couldn't make it work. I know how to use substrings to check against a single string, but I have a long list of them and need to check all of them.

R filter rows based on multiple partial strings applied to multiple columns: This post is similar to what I'm trying to do, but my list of substrings is 400 plus, so I can't write it all out manually in a grepl statement (I think?)

haley
  • 75
  • 5

3 Answers3

1

I guess the problem you're facing is this:

You have a list of what could be called key words (what you call "a list of strings") and a vector/column with text (what you call "a field of interest") and your goal is to filter the vector/column on whether or not any of the key words is present. If that's correct the solution might be this:

Data:

a. List of key words:

keys <- c("how", "why", "what")

b. Dataframe with a vector/column of text:

df <- data.frame(
  text = c("Hi there", "How are you?", "I'm fine.", "So how's work?", "Ah kinda stressful.", "Why?", "Well you know")
)

Solution:

To filter df on keys in text you need to convert keys into a regex alternation pattern (by collapsing the strings with |). Depending on your keys it may be useful or even necessary to also include word \\boundary markers (in case the keys values need to match as such, but not occurring inside other words). And finally, if there may be an issue with lower- or upper-case, we can use the case-insensitive flag (?i):

df %>%
  filter(str_detect(text, str_c("(?i)\\b(", str_c(keys, collapse = "|"), ")\\b")))
            text
1   How are you?
2 So how's work?
3           Why? 
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0

Since there is no particular dataset or reproductible example, I can think of a way to implement it with two apply functions and a smart use of regex. Remember that the regex operator ^ matches only if the following expression shows up in its beginning.

library(dplyr)

MyField <- c("OGame","Game123","Duck","Dugame","Aldubame")

df <- data.frame(MyField)

ListOfStrings <- c("^Ga","^Du") #Notice the use of ^ here

match_s <- function(patterns,entry){
  lapply(patterns,grepl,x = entry) %>% unlist() %>% any()
}

df$match_string <- lapply(df$MyField, match_s, pat = ListOfStrings)

df %>% filter(match_string == 1)
catatau
  • 28
  • 6
0

With dplyr (using stringr for words and sentences as examples) and grepl in conjunction with \\b to get the word boundary match at the beginning.

library(stringr)
library(dplyr)

set.seed(22)

tibble(sentences) %>% 
  rowwise() %>% 
  filter(any(sapply(words[sample(length(words), 10)], function(x) 
    grepl(paste0("\\b", x), sentences)))) %>% 
  ungroup()
# A tibble: 32 × 1
   sentences                                    
   <chr>                                        
 1 It's easy to tell the depth of a well.       
 2 Kick the ball straight and follow through.   
 3 A king ruled the state in the early days.    
 4 March the soldiers past the next hill.       
 5 The dune rose from the edge of the water.    
 6 The grass curled around the fence post.      
 7 Cats and Dogs each hate the other.           
 8 The harder he tried the less he got done.    
 9 He knew the skill of the great young actress.
10 The club rented the rink for the fifth night.
# … with 22 more rows
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29