1

I have a simple regex like this:

@123(?:(?:(?P<test>[\s\S]*)456(?P<test1>(?P>test))789))@

It should match the following string fine:

123aaaa456bbbb789

But it doesn't.

But if I replace the subroutine reference with a direct copy of the regex:

@123(?:(?:(?P<test>[\s\S]*)456(?P<test1>[\s\S]*)789))@

Then it works perfectly fine.

I can't figure out why referencing the pattern by the group name isn't working.

Askerman
  • 787
  • 1
  • 12
  • 31

1 Answers1

2

The point here is that [\s\S]* is a * quantified subpattern that allows a regex engine to backtrack once the subsequent subpatterns fail to match, but the recursion calls in PCRE are atomic, i.e. there is no way for the engine to backtrack when it grabs any 0+ chars with (?P>test), and that is why the pattern fails to match.

In short, @123(?:(?:(?P<test>[\s\S]*)456(?P<test1>(?P>test))789))@ pattern can be re-written as

@123(?:(?:(?P<test>[\s\S]*)456(?P<test1>[\s\S]*+)789))@
                                              ^^

and as [\s\S]*+ already matches 789, the engine cannot backtrack to match 789 pattern part.

See PCRE docs:

In PCRE (like Python, but unlike Perl), a recursive subpattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure.

No idea why they mention Python here since re does not support recursion (unless they meant the PyPi regex module).

If you are looking for a solution, you might use a (?:(?!789)[\s\S])* tempered greedy token instead of [\s\S]*, it will only match any char if it does not start a 789 char sequence (so, no need to backtrack to accommodate for 789):

123(?:(?:(?P<test>(?:(?!789)[\s\S])*)456(?P<test1>(?P>test))789))
                  ^^^^^^^^^^^^^^^^^^

See this regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • In this case regex `123(?:(?:(?P\D*)456(?P(?P>test))789))` solves the problem – splash58 Jan 26 '18 at 11:01
  • @splash58 Yes, because `\D*` does not match digits, but in case there is a digit between `456` and `789` and the match *is* expected, that cannot be a generic solution. A more appropriate (though not so efficient) is a [tempered greedy token solution](https://regex101.com/r/pyuAHi/1). – Wiktor Stribiżew Jan 26 '18 at 11:04
  • an interesting approach :) – splash58 Jan 26 '18 at 11:44