0

Is that possible to use a pattern like this (see below) in grepl()?

(poverty OR poor) AND (eradicat OR end OR reduc OR alleviat) AND extreme

The goal is to determine if a sentence meets the pattern using ifelse(grepl(pattern, x, ignore.case = TRUE),"Yes","No")

For example, if x = "end extreme poverty in the country", it will return "Yes", while if x = "end poverty in the country", it will return "No".

An earlier post here works only for single work like poor AND eradicat AND extreme, but not work for my case. Any way to achieve my goal?

Tried this, pattern = "(?=.*poverty|poor)(?=.*eradicat|end|reduce|alleviate)(?=.*extreme)", but it does not work. The error is 'Invalid regexp'

Yingjie
  • 48
  • 5

1 Answers1

1

For using all 3 assertions, you can group the words using a non capture group.

^(?=.*(?:poverty|poor))(?=.*extreme)(?=.*(?:eradicat|end|reduc|alleviat)).+
  • ^ Start of string
  • (?=.*(?:poverty|poor)) Assert either poverty OR poor
  • (?=.*extreme) Assert extreme
  • (?=.*(?:eradicat|end|reduc|alleviat)) Assert either eradicat OR end OR reduc or alleviat
  • .+ Match the whole line for example

Regex demo

For grepl, you have to use perl=T enabling PCRE for the lookarounds.

grepl('^(?=.*(?:poverty|poor))(?=.*extreme)(?=.*(?:eradicat|end|reduc|alleviat)).+', v, perl=T)
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • This works perfectly! Thank you so much! Would you mind explaining a bit more about `^` and `.+`? Are they must be included or optional? It seems `^` here does not mean a sentence should start with "poverty" or "poor", but it does mean this in other situations. For `.+`, I am not sure exactly its function. – Yingjie Jun 22 '21 at 21:09
  • 1
    @Yingjie The `^` is an anchor to assert the start of the string. That is the only time you want to run all the assertions, and assertions by them selves are non consuming, only asserting. The `.+` is a very broad pattern to actually match any character except a newline. You can make the match more specific if you want to allow only certain characters. If you don't want to match partial words like `extremes` you can also add word boundaries using `\b` for example `^(?=.*\b(?:poverty|poor)\b)(?=.*\bextreme\b)(?=.*\b(?:eradicat|end|reduc|alleviat)\b).+` – The fourth bird Jun 22 '21 at 21:11
  • Thanks for your previous answer! I have a follow-up question -- what should I do if I want to add another `OR` assertion? Building on the previous example, I also want to see if a sentence can match the string "sustainable". I change the code to : `pat <- "^(?=.*(?:poverty|poor))(?=.*extreme)(?=.*(?:eradicat|end|reduc|alleviat)).+|sustainable"` and use this example `text <- "end extreme poor and achieve sustainable"`, but `stringr::str_count(string = text, regex(pattern = pat, ignore_case = T))` only returned 1 match (should be 2 matches). I look forward to your advice. Thanks! – Yingjie Dec 08 '21 at 17:57
  • would you mind taking a second look at my question? Many thanks! – Yingjie Dec 09 '21 at 17:54
  • @Yingjie Like this? https://regex101.com/r/lQPncg/1 – The fourth bird Dec 09 '21 at 18:33
  • Thanks but actually, I would like `'^(?=.*(?:poverty|poor))(?=.*extreme)(?=.*(?:eradicat|end|reduc|alleviat)).+'` and `sustainable` to be connected with `OR` but not `AND`. I changed yours to https://regex101.com/r/mtdTlk/1 - but seems not right – Yingjie Dec 09 '21 at 19:05
  • @Yingjie If you want to either have a match when all the assertions are true, OR a match for sustainable, you can use an alternation with the pipe inside a grouping construct `(?:..|..)` See https://regex101.com/r/ssVugB/1 – The fourth bird Dec 09 '21 at 19:20
  • Thanks again and this is great! It seems your 4th example can match twice (i.e., 1st match - "end extreme poor", 2nd match - "sustainable"), but both in this webpage and in R `stringr::str_count()`, they only return the number of matches as once. Is there a way to fix this? – Yingjie Dec 09 '21 at 21:55
  • @Yingjie I don't think you can do that as the pattern is anchorered `^` at the start of the string to perform the lookahead assertions over the whole string. If you want all matches, you could split the matched string on a space, and then filter for all the allowed words. – The fourth bird Dec 09 '21 at 21:58
  • Good to know! Thanks!!! – Yingjie Dec 09 '21 at 22:32