27

I was recently reading along in the PCRE - (Perl-compatible regular expressions) documentation and came across some interesting tricks with regular expression. As I continued to read and exhaust myself, I stopped because of some confusion in relation with using a few of the (*...) patterns.

My question and confusion relates to (*PRUNE) and (*FAIL)

Now for reference (*SKIP) acts like (*PRUNE), except that if the pattern is unanchored, the bumpalong advance is not to the next character, but to the position in the subject where (*SKIP) was encountered.

The documentation states that (*PRUNE) causes the match to fail at the current starting position in the subject if the rest of the pattern does not match. And it states (*FAIL) synonymous with (?!) negative assertion. Forces a matching failure at the given position in the pattern.

So basically (*FAIL) behaves like a failing negative assertion and is a synonym for (?!)

And (*PRUNE) causes the match to fail at the current starting position in the subject if there is a later matching failure that causes backtracking to reach it.

How are these different when it comes to a point of failing?

Can anyone provide examples of how these are implemented and used correctly?

Community
  • 1
  • 1
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • 3
    I suggest that you should become *completely fluent* with the upper levels of Perl regular expressions before you worry yourself with elements like this. In addition, if you choose to use such arcane functionality then you should have an extremely good reason, and be prepared to shoulder the hatred of anyone who comes to read or maintain your code. Regular expressions fascinate people, and far too often they are forced to perform work that belongs elsewhere. PCRE is deficient in many ways as a language, and its use should be confined to trivial instances driven by higher-level constructs. – Borodin Nov 15 '13 at 03:43
  • This question has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Advanced Regex-Fu". – aliteralmind Apr 10 '14 at 01:40

1 Answers1

40

Before reading this answer, you should be familiar with the mechanism of backtracking, atomic groups, and possessive quantifiers. You can find information about these notions and features in the Friedl book and following these links: www.regular-expressions.info, www.rexegg.com

All the test has been made with a global search (with the preg_match_all() function).

(*FAIL) (or the shorthand (*F))

baabo caaco daado

caac(*FAIL)|aa.|caaco|co

[0] => aab
[1] => caaco
[2] => aad

(*FAIL) causes exactly the same behavior as a "bad character" in a pattern. If you replace it with an "R", for example, you obtain exactly the same result: caacR|aa.|caaco|co. To be more general, you can indeed replace (*FAIL) with an "always failing subpattern" like: (?!), (?=a(?<!a)),...

a (first from "baabo") : Without surprise, the first result is found by the second alternative. (aab)

c (first) : The regex engine encounters the first "c" and tries the first alternative and finds: caac, but the subpattern is forced to fail. Then the regex engine (always from the first "c") tries the second alternative that fails, the third alternative succeeds. (caaco)

a (first from "daado") : the third result is found by the second alternative. (aad)

(*SKIP)

baabo caaco daado

caa(*SKIP)c(*FAIL)|aa.|caaco|co

[0] => aab
[1] => co
[2] => aad

This verb defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later. Consequently, all the characters found before with the subpattern are consumed, once and for all, and can not be used for another part of the pattern (alternative).

a (first from "baabo") : the first result is found by the second alternative. (aab)

c (first) : the regex engine finds caac as in the first case, then fails (cause of the (*FAIL) verb), backtracks to the second "c" but is not allowed to backtrack to characters that are previously matched ("caa") before the (*SKIP) verb.
c (second) : Now, the regex engine tries always the first alternative but in this new position and fails since there is an "o" and not an "a" after, then it backtracks to this second "c". Note that in this case, these characters are not consumed as previously because the subpattern has failed before to have reached the (*SKIP) verb. The second alternative is tested and fails (doesn't begin with a "c"). The third alternative fails too because the next character is an "o" and not an "a". The fourth alternative succeeds and gives the second result. (co)

a (first from "daado") : the third result is found by the second alternative. (aad)

(*PRUNE)

baabo caaco daado

caa(*PRUNE)c(*FAIL)|aa.|caaco|co

[0] => aab
[1] => aac
[2] => aad

This verb is different from (*SKIP) because it doesn't forbid using all the previous matched characters, but skips the first matched character by the subpattern (or forbids a subpattern to start with) if the subpattern will fail later.

a (first from "baabo") : the first result is found by the second alternative. (aab)

c (first) : the regex engine finds caac as in the first case, then fails, but now backtracks to the first "a" from "caaco" since the first "c" is skipped.
a (first from "caaco") : the first alternative is tried and fails, the second succeeds and gives the second result. (aac)

a (first from "daado") : the third result is found by the second alternative. (aad)

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • 1
    Thank you for the very detailed answer, this sums my confusion. – hwnd Nov 15 '13 at 19:35
  • 1
    @hwnd: The pcre documentation is indeed light on examples about the subject. – Casimir et Hippolyte Nov 15 '13 at 19:47
  • Feel free to post an answer on this one. http://stackoverflow.com/questions/19437211/special-construct-of-expression-use – hwnd Nov 18 '13 at 02:32
  • 2
    Cool it is my first bounty! Thanks @HamZa. – Casimir et Hippolyte Jan 29 '14 at 22:41
  • This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Control Verbs and Recursion". – aliteralmind Apr 10 '14 at 01:08
  • 3
    too sad, this great answer only viewed 370 times! – Eathen Nutt Apr 13 '14 at 00:27
  • @CasimiretHippolyte it's unclear for me. What is `a (first from "baabo")`? – Avinash Raj Jul 12 '14 at 05:02
  • the first letter "a" in the word "baabo" – Casimir et Hippolyte Jul 12 '14 at 06:09
  • @CasimiretHippolyte: It seems that what `(*SKIP)` does is: On backtracking to `(*SKIP)`, it fails the pattern outright and set the starting point of the next match to the index of `(*SKIP)`, or if it's the same index, bump it along by one character. At least that is what I can observe from this test: https://regex101.com/r/gL8uX9/1. Also, if `(*SKIP)` is not on the backtracking path, `(*SKIP)` doesn't have any effect: https://regex101.com/r/lX2bG3/1 `(?=.(*SKIP))(*FAIL)|.` – nhahtdh Mar 17 '16 at 03:05