0

My question relates to matching co-occurring patterns in texts. For example, in this vector:

x <- c("oh well i didn't know that",                        # not match
       "yeah right that's true",                            # not match
       "okay, so she lives in, well, i guess soho",         # not match
       "well, uh oh, okay in that case",                    # match
       "okay, right, well oh why not")                      # match

I would like to match those strings in which the words well, oh, and okay co-occur in any order. I've come up with this regex, which, however, incorrectly also matches "okay, so she lives in, well, i guess soho" (confusingly, despite the fact that I've used word boundary anchors \\b for oh):

grep("(?=\\boh\\b)*(?=\\bwell\\b)*(?=\\bokay\\b).*", x, perl = T, value = T) 
# "okay, so she lives in, well, i guess soho"  "well, uh oh, okay in that case"  "okay, right, well oh why not"

How can this regex be tweaked to match all and any strings in which well, oh, and okay co-occur as words?

Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34

0 Answers0