Why do atomic groups and non-capturing seem to add many steps

Question

I am re-asking the exact same question because the last one was marked as a duplicate of this one, despite them being different situations. The "duplicate" answer is about whether the regex is anchored or not, and the use of the m flag or not, IOW, the regexes were actually functionally different. This one is asking why theoretically no-effect non-capture groups affect the number of steps.

Just looking at results from regex101, I am trying a fairly simple test:

Match /^\s+a/ (notice no g or m flags) against " b", which of course fails to match.

Just running /^\s+a/, regex101 reports 3 steps, so very efficient.

If I do /^(?:\s)+a/ or /^(?>\s)+a/, it suddenly jumps up to 24 steps. If I was to chalk this up to just having regex101 all of a sudden "count" each space whereas before it counted all the spaces as one step, then I'd expect the result to be 9 (start, 7 spaces, incorrect end character). But this is roughly triple that result. Is it still the case that its all just "what counts as a step" stuff, and they are actually equally fast (considering that non-capture groups and atomic groups should not be storing any information for later use), or is there actually a roughly 3x algorithmic slowdown just from using a useless group.

This is of course just a reduction of a problem I'm dealing with, where I am using non-capture groups for a practical reason (having two choices ored together), and seeing the number of steps explode well beyond the 2x I'd expect from checking 2 kinds of things vs. one kind of thing. I narrowed it down to this non-capture group thing.

Quick Edit: For whatever reason, regex101 will report "0 steps" in the as-you-type view, but if you click on the debugger, it will show the actual number of steps:

Also, make sure to turn off the g and m flags.

Here are links to all three:

1./^\s+a/: https://regex101.com/r/97W4dk/1

/^(?:\s)+a/: https://regex101.com/r/3LjfF6/1
/^(?>\s)+a/: https://regex101.com/r/Lg5gLw/1

Ok, you use a group to capture a char as a repeated group. This is a known thing. The PCRE regex engine just can't apply all its inner optimizations one the pattern is enclosed in a group. Groups prevent the inner PCRE pattern optimization. — Wiktor Stribiżew, Dec 02 '20 at 21:50
Yes, I see, you selected PCRE option. So, you disabled all the optimizations by introducing the redundant groupings. Look, if you disable auto-possession and use `(*NO_AUTO_POSSESS)^\s+a`, it will already [jump to 9 steps](https://regex101.com/r/ngs0Hb/2/debugger). — Wiktor Stribiżew, Dec 02 '20 at 21:52
So, it is "conceptually" equivalent? That is to say, there's no for example extra backtracking taking place in one that doesn't in the other? It's just \s+ can be consumed "in one step" vs. 7 or 20 or whatever? IOW, what would have been optimized internally into an externally visible "single action" becomes an "external" 3 steps? — Francisco Ryan Tolmasky I, Dec 02 '20 at 21:54
And to further clarify, I care about ECMA regex, but chose PCRE since its the only one that shows step count vs. just ms, which is harder to reproduce since everything is always just 0ms unless you make really long examples. — Francisco Ryan Tolmasky I, Dec 02 '20 at 21:58
ECMA regex is very different from PCRE, no need to use PCRE then. — Wiktor Stribiżew, Dec 02 '20 at 22:02

Why do atomic groups and non-capturing seem to add many steps

0 Answers0

Linked