My question relates to matching co-occurring patterns in texts. For example, in this vector:
x <- c("oh well i didn't know that", # not match
"yeah right that's true", # not match
"okay, so she lives in, well, i guess soho", # not match
"well, uh oh, okay in that case", # match
"okay, right, well oh why not") # match
I would like to match those strings in which the words well
, oh
, and okay
co-occur in any order. I've come up with this regex, which, however, incorrectly also matches "okay, so she lives in, well, i guess soho"
(confusingly, despite the fact that I've used word boundary anchors \\b
for oh
):
grep("(?=\\boh\\b)*(?=\\bwell\\b)*(?=\\bokay\\b).*", x, perl = T, value = T)
# "okay, so she lives in, well, i guess soho" "well, uh oh, okay in that case" "okay, right, well oh why not"
How can this regex be tweaked to match all and any strings in which well
, oh
, and okay
co-occur as words?