5

I'm trying to use this regexp in R:

\?(?=([^'\\]*(\\.|'([^'\\]*\\.)*[^'\\]*'))*[^']*$)

I'm escaping like so:

\\?(?=([^'\\\\]*(\\\\.|'([^'\\\\]*\\\\.)*[^'\\\\]*'))*[^']*$)

I get an invalid regexp error.

Regexpal has no problem with the regex, and I've checked that the interpreted regex in the R error message is the exact same as what I'm using in Regex pal, so I'm sort of at a loss. I don't think the escaping is the problem.

Code:

output <- sub("\\?(?=([^'\\\\]*(\\\\.|'([^'\\\\]*\\\\.)*[^'\\\\]*'))*[^']*$)", "!", "This is a test string?")
John Chrysostom
  • 3,973
  • 1
  • 34
  • 50
  • 2
    Just set `T <- 0` (as might be written by someone setting up a survival analysis simulation), and see what happens. (Then try `TRUE <- 0`) – IRTFM Jul 21 '15 at 18:40

1 Answers1

7

R by default uses the POSIX (Portable Operating System Interface) standard of regular expressions (see these SO posts [1,2] and ?regex [caveat emptor: machete-level density ahead]).

Look-ahead ((?=...)), look-behind ((?<=...)) and their negations ((?!...) and (?<!...)) are probably the most salient examples of PCRE-specific (Perl-Compatible Regular Expressions) forms, which are not compatible with POSIX.

R can be trained to understand your regex by activating the perl option to TRUE; this option is available in all of the base regex functions (gsub, grepl, regmatches, etc.):

output <- sub(
  "\\?(?=([^'\\\\]*(\\\\.|'([^'\\\\]*\\\\.)*[^'\\\\]*'))*[^']*$)",
  "!",
  "This is a test string?",
  perl = TRUE
)

Of course it looks much less intimidating for R>=4.0 which has raw string support:

output <- sub(
  R"(\?(?=([^'\\]*(\\.|'([^'\\]*\\.)*[^'\\]*'))*[^']*$))",
  "!",
  "This is a test string?",
  perl = TRUE
)
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198