0

I've been working with WordPress for a while now, and saw a regex inside the shortcode_unautop that I don't get what it means.

This is the regex from shortcode_unautop:

/<p>\s*+(\[(?:$tagregexp)(?![\w-])[^\]\/]*(?:\/(?!\])[^\]\/]*)*?(?:\/\]|\](?:[^\[]*+(?:\[(?!\/\2\])[^\[]*+)*+\[\/\2\])?))\s*+<\/p>/s

There you'll see the regex to match and leading whitespace in the \s*+ form. What does the *+ (asterisk followed by plus sign) means inside this regex? FYI, the site www.regexr.com says that *+ is an invalid modifier.

Thank you!

1 Answers1

3

In PCRE, the quantifier + after another quantifier (either * or + or ? or even the {m,n}) actually modifies the preceding quantifier so that it now matches possessively.

*+ is a possessive quantifier meaning 0 or more, without backtracking.

Backtracking is one of the basic processes in regex. Let's say you have abcbaba as string and use the regex .*bc.

The engine will move following the arrow, first with .*:

 a b c b a b a
^
 a b c b a b a
  ^
 a b c b a b a
    ^ 
 a b c b a b a
      ^
 a b c b a b a
        ^
 a b c b a b a
          ^
 a b c b a b a
            ^
 a b c b a b a
              ^

At this point, it cannot match more so it will backtrack one character at a time to be able to match the b in the regex.

 a b c b a b a
            ^

No b, continue:

 a b c b a b a
          ^

There, b matches, so it tries to match c, but cannot find one. It will backtrack again and a couple of steps later...

a b c b a b a
 ^

So .* ended up matching only a.

With .*+, you get the .* to match everything like in the first case...

 a b c b a b a
              ^

But then cannot match more, and backtracking is forbidden to it. So the matching fails.

Sometimes, you want to have backtracking, but at other times, you don't and on the contrary, it's a nuisance. That's why you have possessive quantifiers and atomic groups, to speed things up.

Jerry
  • 70,495
  • 13
  • 100
  • 144