Negative lookahead in R not behaving as expected

Question

I am trying to replace instances in a string which begin with abc in a text I'm working with in R. The output text is highlighted in HTML over a couple of passes, so I need the replacement to ignore text inside HTML carets.

The following seems to work in Python but I'm not getting any hits on my regex in R. All help appreciated.

test <- 'abcdef abc<span abc>defabc abcdef</span> abc defabc'
gsub('\\babc\\(?![^<]*>\\)', 'xxx', test)

Expected output:

xxxdef xxx<span abc>defabc xxxdef</span> xxx defabc

Instead it is ignoring all instances of abc.

keep in mind http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — MichaelChirico, Apr 17 '17 at 19:57

score 7 · Accepted Answer · answered Apr 17 '17 at 19:54

7

You need to remove unnecessary escapes and use perl=TRUE:

test <- 'abcdef abc<span abc>defabc abcdef</span> abc defabc'
gsub('\\babc(?![^<]*>)', 'xxx', test, perl=TRUE)
## => [1] "xxxdef xxx<span abc>defabc xxxdef</span> xxx defabc"

See the online R demo

When you escape (, it matches a literal ( symbol, so, in your pattern, \$?![^<]*>\$ matches a ( 1 or 0 times, then !, then 0+ chars other than <, then > and a literal ). In my regex, (?![^<]*>) is a negative lookahead that fails the match if an abc is followed with any 0+ chars other than < and then a >.

Without perl=TRUE, R gsub uses the TRE regex flavor that does not support lookarounds (even lookaheads). Thus, you have to tell gsub via perl=TRUE that you want the PCRE engine to be used.

See the online PCRE regex demo.

answered Apr 17 '17 at 19:54

Wiktor Stribiżew

607,720
39
448
563

> You need to remove unnecessary escapes !@#$@#!%% Thanks Wiktor! – Rich Ard Apr 17 '17 at 20:08
1

`perl=TRUE` is something I always forget. Thanks. – Lazarus Thurston Oct 06 '20 at 17:23
1

@LazarusThurston Sometimes, you do not need that, you'd choose not to use `perl=TRUE` very seldom, in most cases, yes, `perl=TRUE` is a good idea. See the [TRE vs. PCRE differences](https://stackoverflow.com/a/47251004/3832970). – Wiktor Stribiżew Oct 06 '20 at 17:30

Negative lookahead in R not behaving as expected

1 Answers1

Linked