R AND Operator in Regex

Question

I am trying to get an expression that takes a huge few paragraphs and finds lines with two specific words both in that lines, so I am looking for the AND operator? Any way how to do this?

For example:

c <- ("She sold seashells by the seashore, and she had a great time while doing so.")

I want an expression that finds a line with both "sold" and "great" in the line.

I've tried something like:

grep("sold", "great", c, value = TRUE)

Any ideas?

Thanks so much!

Kevin Arseneau · Answer 1 · 2017-09-17T02:58:55.467

4

You can create two capture groups, assuming the order of the words is unimportant

grep("(sold|great)(?:.+)(sold|great)", c, value = TRUE)

edited Sep 17 '17 at 02:58

answered Sep 17 '17 at 02:30

Kevin Arseneau

6,186
1
21
40

Thanks, but I'm actually looking for a line that contains both, not either word. If the line has sold but not great, I don't want the line to be returned. – intern14 Sep 17 '17 at 02:46
@intern14, apologies, I misunderstood. See my edit above. – Kevin Arseneau Sep 17 '17 at 03:01

score 3 · Answer 2 · answered Sep 17 '17 at 07:58

While in most cases, I would go with stringr package as already suggested in CPak's answer, there is also i grep solution to this:

# create the sample string
c <- ("She sold seashells by the seashore, and she had a great time while doing so.")

# match any sold and great string within the text
# ignore case so that Sold and Great are also matched
grep("(sold.*great|great.*sold)", c, value = TRUE, ignore.case = TRUE)

Hmm, not bad, right? But what if there was a word merely containing the phrase sold or great?

# set up alternative string
d <- ("She saw soldier eating seashells by the seashore, and she had a great time while doing so.")
# even soldier is matched here:
grep("(sold.*great|great.*sold)", d, value = TRUE, ignore.case = TRUE)

So you might want to use word boundaries, i.e. match the entire word:

# \\b is a special character which matches word endings
grep("(\\bsold\\b.*\\bgreat\\b|\\bgreat\\b.*\\bsold\\b)", d, value = TRUE, ignore.case = TRUE)

the \\b matches first character in the string, last character in the string or between two characters where one belongs to a word and the other does not:

More on the \b metacharacter here: http://www.regular-expressions.info/wordboundaries.html

score 1 · Answer 3 · answered Sep 17 '17 at 03:09

The duplicate post might get you started but I don't think addresses your question directly.

You could combine stringr::str_detect with all

pos <- ("She sold seashells by the seashore, and she had a great time while doing so.") # contains sold and great
neg <- ("She bought seashells by the seashore, and she had a great time while doing so.") # contains great

pattern <- c("sold", "great")

library(stringr)
all(str_detect(pos,pattern))
# [1] TRUE

all(str_detect(neg,pattern))
# [1] FALSE

stringr::detect has the advantage (over grepl) of searching over a character vector of patterns

R AND Operator in Regex

3 Answers3