2

I trying to detect all occurrences of a certain string, that is not surrounded by certain strings (using regex lookarounds). Eg. all occurrences of "African" but not "South African Society". See a simplified example below.

#My example text:
text <- c("South African Society", "South African", 
"African Society", "South African Society and African Society")

#My code examples:
str_detect(text, "(?<!South )African(?! Society)")
#or
grepl("(?<!South )African(?! Society)",  perl=TRUE , text)

#I need:
[1] FALSE TRUE TRUE TRUE 

#instead of:
[1] FALSE FALSE FALSE FALSE

The problem seems to be that regex evaluates the lookbehind and the lookahead separately and not as a whole. It should require both conditions and not only one.

MsGISRocker
  • 588
  • 4
  • 21

1 Answers1

4

The (?<!South )African(?! Society) pattern matches African when it is not preceded with neither South nor Society. If there is South or Society there will be no match.

There are several solutions.

 African(?<!South African(?= Society))

See the regex demo. Here, African is only matched when the regex engine does not find South African at the position after matching African substring that is immediately followed with space and Society. Using this check after African is more efficient in case there are longer strings that do not match the pattern than moving it before the word African (see the (?<!South (?=African Society))African regex demo).

Alternatively, you may use a SKIP-FAIL technique:

South African Society(*SKIP)(*F)|African

See another regex demo. Here, South African Society is matched first, and (*SKIP)(*F) makes this match fail and proceed to the next match, so African is matched in all contexts other than South African Society.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Hey Wiktor, you re the one! Thx a lot. Just a tiny add-on to my simplified question. In practice, I need to exclude a lot of organizations' names from my matches (eg. "African Journal", "Royal African Society" etc.). What would you consider the most efficient coding for that? – MsGISRocker Nov 29 '18 at 10:30
  • @MrGISRocker Glad to help. Please consider accepting the answer. – Wiktor Stribiżew Nov 29 '18 at 10:31
  • 1
    @MrGISRocker `(*SKIP)(*F)` technique will be the simplest if you want to avoid manual pattern building. – Wiktor Stribiżew Nov 29 '18 at 10:47