How exactly does this recursive regex work?

Question

This is a followup to this question.

Have a look at this pattern:

(o(?1)?o)

It matches any sequence of o with a length of 2ⁿ, with n ≥ 1.
It works, see regex101.com (word boundaries added for better demonstration).
The question is: Why?

In the following, the description of a string (match or not) will simply be a bolded number or a bolded term that describes the length, like 2ⁿ.

Broken down (with added whitespaces):

( o (?1)? o )
(           ) # Capture group 1
  o       o   # Matches an o each at the start and the end of the group
              # -> the pattern matches from the outside to the inside.
    (?1)?     # Again the regex of group 1, or nothing.
              # -> Again one 'o' at the start and one at the end. Or nothing.

I don't understand why this doesn't match 2n, but 2ⁿ, because I would describe the pattern as *an undefined number of o o, stacked into each other.

Visualization:

No recursion, 2 is a match:

oo

One recursion, 4 is a match:

o  o
 oo

So far, so easy.

Two recursions. Obviously wrong because the pattern does not match 6:

o    o
 o  o
  oo

But why? It seems to fit the pattern.

I conclude that it's not simply the plain pattern that is repeated because otherwise 6 would have to match.

But according to regular-expressions.info:

(?P<name>[abc])(?1)(?P>name) matches three letters like (?P<name>[abc])[abc][abc] does.

and

[abc])(?1){3} [...] is equivalent to ([abc])[abc]{3}

So it does seem to simply rematch the regex code without an information about the previous match of the capture group.

Can someone explain and maybe visualize why this pattern matches 2ⁿ and nothing else?

Edit:

It was mentioned in the comments:

I doubt that referencing a capture group inside of itself is actually a supported case.

regular-expressions.info does mention the technique:

If you place a call inside the group that it calls, you'll have a recursive capturing group.

You understand recursion correctly. Word boundaries baffle you here. [Look here](https://regex101.com/r/SJ3SF3/1), 6 `o`s are matched just fine. — Wiktor Stribiżew, May 10 '17 at 10:39
That's interesting. You're right, that baffles me. Where's the difference between 6, 8, 12, and 16 in regards to word boundaries? I'll edit the question later. — Imanuel, May 10 '17 at 10:50

Wiktor Stribiżew · Accepted Answer · 2017-05-10T12:02:45.073

You understand recursion correctly. Word boundaries baffle you here. The \b around the pattern require the regex engine to only match the string if it is not preceded and followed with word chars.

See how the recursion goes here:

( o      (?1)?         o )  => oo

(?1) is then replaced with (o(?1)?o):

( o   (?>o(?1)?o)?     o )  => oo or oooo

Then again:

(o (?>o(?>o(?1)?o)?o)?  o) => oo, oooo, oooooo

See the regex demo without word boundaries.

Why adding (?>...) in the example above? Each recursion level in PHP recursive regexes is atomic, unlike Perl, and once a preceding level fails, the engine does not go back to the following one.

When you add word boundaries, the first o and last o matched cannot have any other word chars before/after. So, ooo won't match then.

See Recursive Regular Expressions explained step by step and Word Boundary: \b at rexegg.com, too.

Why does oooooo not get matched as a whole but as oooo and oo?

Again, each recursion level is atomic. oooooo is matched like this:

(o(?1)?o) matches the first o
(?1)? gets expanded and the pattern is now (o(?>o(?1)?o)?o) and it matches the second o in the input
It goes on until (o(?>o(?>o(?>o(?>o(?>o(?>o(?1)?o)?o)?o)?o)?o)?o)?o) that does not match the input any longer, backtracking happens, we go to the 6th level,
The whole 6th recursion level also fails since it cannot match the necessary amount of os
This goes on until the level that can match the necessary amount of os.

See the regex debugger:

I still struggle to understand, why 6 `o`s is matched as 4 + 2, 7 `o`s is matched as 6? — Sebastian Proske, May 10 '17 at 11:19
@SebastianProske: Check [this debugger](https://regex101.com/r/nMqIaT/1/debugger) - the first `o` (on the left side of the recursion construct) grabs all the `o`s in the input string. Then each final `o` must be accommodated for *on each depth level*. The engine backtracks within the main subpattern this way. — Wiktor Stribiżew, May 10 '17 at 11:29
@SebastianProske: And it also has got to do with the fact that [each recursion depth is atomic](http://www.rexegg.com/regex-recursion.html): since the first `o` before `(?1)` matched all the `o`s in the string, then there is no place for the final `o` to match as there is no more text for the last but one recursion level. — Wiktor Stribiżew, May 10 '17 at 11:38
Thanks, I finally got it all figured out. I have added the steps I took to realize as an answer - but yours definitely deserves my +1. — Sebastian Proske, May 10 '17 at 12:42

score 2 · Answer 2 · answered May 10 '17 at 12:40

This is more or less a follow up of Wiktors answer - even after removing the word boundaries, I had a hard time figuring out why oooooo (6) gets matched as oooo and oo, while ooooooo (7) gets matched as oooooo.

Here is how it works in detail:

When expanding the recursive pattern, the inner recursions are atomic. With our pattern we can unroll it to

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)

(In the actual pattern this get's unrolled once more, but that doesn't change the explanation)

And here is how the strings are matched - first oooooo (6)

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)
o   |ooooo                          <- first o gets matched by first atomic group
o   o   |oooo                       <- second o accordingly
o   o   o   |ooo                    <- third o accordingly
o   o   o   o   |oo                 <- fourth o accordingly
o   o   o   o   oo|                 <- fifth/sixth o by the innermost atomic group
                     ^              <- there is no more o to match, so backtracking starts - innermost ag is not matched, cursor positioned after 4th character
o   o   o   o   xx   o   |o         <- fifth o matches, fourth ag is successfully matched (thus no backtracking into it)
o   o   o   o   xx   o   o|         <- sixth o matches, third ag is successfully matched (thus no backtracking into it)
                           ^        <- no more o, backtracking again - third ag can't be backtracked in, so backtracking into second ag (with matching 3rd 0 times)
o   o                      |oo<oo   <- third and fourth o close second and first atomic group -> match returned  (4 os)

And now ooooooo (7)

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)    
o   |oooooo                         <- first o gets matched by first atomic group
o   o   |ooooo                      <- second o accordingly
o   o   o   |oooo                   <- third o accordingly
o   o   o   o   |ooo                <- fourth o accordingly
o   o   o   o   oo|o                <- fifth/sixth o by the innermost atomic group
o   o   o   o   oo  o|              <- fourth ag is matched successfully (thus no backtracking into it)
                         ^          <- no more o, so backtracking starts here, no backtracking into fourth ag, try again 3rd
o   o   o                |ooo<o     <- 3rd ag can be closed, as well as second and first -> match returned (6 os)

How exactly does this recursive regex work?

2 Answers2

Linked