5

I was reading/learning The Greatest Regex Trick Ever where we say we want something unless...using (*SKIP)(*FAIL). OK so I took it for a spin on the toy example below and it works in base R but has the following error in stringi. Do I need to do something different with stringi to get the syntax to work?

x <- c("I shouldn't", "you should", "I know", "'bout time")
pat <- '(?:houl)(*SKIP)(*FAIL)|(ou)'

grepl(pat, x, perl = TRUE)
## [1] FALSE  TRUE FALSE  TRUE

stringi::stri_detect_regex(x, pat)
## Error in stringi::stri_detect_regex(x, pat) : 
##   Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • 5
    `stringi` uses the ICU regex flavor, which doesn't support control verbs like `(*SKIP)` and `(*FAIL)`. They based it on the Java flavor, so the Java version of the Trick (such as it is) should work. – Alan Moore Jan 14 '16 at 23:04
  • You can use this trick too: `ou(?:(?!l)|(?<!hou))`. The advantage is that the pattern starts with a literal string (that speed up the research) and lookarounds are tested only after. – Casimir et Hippolyte Jan 14 '16 at 23:25
  • @AlanMoore informative thank you, I couldn't seem to locate a jave equivalent. I tried `stringi::stri_detect_regex(x, 'houl|(ou)')` but it yields: `[1] TRUE TRUE FALSE TRUE` where it should fail on the first element. – Tyler Rinker Jan 15 '16 at 02:20
  • 1
    Try with http://stackoverflow.com/questions/24534782/how-do-skip-or-f-work-on-regex – gagolews Feb 07 '16 at 14:00

1 Answers1

2

The stringi module (and stringr as well) is bundled with the ICU regex library and (*SKIP)(*FAIL) verbs are not supported (they are actually only supported by PCRE library).

Since you are matching ou that are not preceded with h and not followed with l, you can use usual lookarounds:

(?<!h)ou(?!l)

See the regex demo

> x <- c("I shouldn't", "you should", "I know", "'bout time")
> pat1 <- "(?<!h)ou(?!l)"
> stringi::stri_detect_regex(x, pat1)
[1] FALSE  TRUE FALSE  TRUE

I can also suggest another approach here. Since your code implies you want to just return a boolean value indicating if there is ou inside a string but not houl, you may use

stringi::stri_detect_regex(x, "^(?!.*houl).*ou")

See another regex demo

Details

  • ^ - start of the string
  • (?!.*houl) - a negative lookahead that fails the match if right after the start of string there are 0+ chars other than line break chars as many as possible followed with houl
  • .*- 0+ chars other than line break chars as many as possible
  • ou - an ou substring.

More details on Lookahead and Lookbehind Zero-Length Assertions.

Note that in ICU a lookbehind cannot contain patterns of unknown width, however, limiting quantifiers inside lookbehinds are supported. So, in stringi, if you wanted to match any word containing ou that is not preceded with s somewhere to the left, you can use

> pat2 <- "(?<!s\\w{0,100})ou"
> stringi::stri_detect_regex(x, pat2)
[1] FALSE  TRUE FALSE  TRUE

Where (?<!s\\w{0,100}) constrained-width lookbehind fails the match if ou is preceded with s followed with 0 to 100 alphanumeric or underscore characters.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563