56

Why does replacing \s* (or even \s\s*) with \s+ result in such a speedup for this input?

use Benchmark qw(:all);
$x=(" " x 100000) . "_\n";
$count = 100;
timethese($count, {
    '/\s\s*\n/' => sub { $x =~ /\s\s*\n/ },
    '/\s+\n/' => sub { $x =~ /\s+\n/ },
});

Link to online version

I noticed a slow regex s/\s*\n\s*/\n/g in my code - when given a 450KB input file consisting of lots of spaces with a few non-spaces here and there, and a final newline at the end - the regex hung and never finished.

I intuitively replaced the regex with s/\s+\n/\n/g; s/\n\s+/\n/g; and all was well.

But why is it so much faster? After using re Debug => "EXECUTE" I noticed the \s+ version is somehow optimised to run in only one iteration: http://pastebin.com/0Ug6xPiQ

Matching REx "\s*\n" against "       _%n"
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d ][{non-utf8-latin1-all}{unicode_all}] against "       _%n" (9 bytes)
   0 <> <       _%n>         |  1:STAR(3)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...
   1 < > <      _%n>         |  1:STAR(3)
                                  SPACE can match 6 times out of 2147483647...
                                  failed...
   2 <  > <     _%n>         |  1:STAR(3)
                                  SPACE can match 5 times out of 2147483647...
                                  failed...
   3 <   > <    _%n>         |  1:STAR(3)
                                  SPACE can match 4 times out of 2147483647...
                                  failed...
   4 <    > <   _%n>         |  1:STAR(3)
                                  SPACE can match 3 times out of 2147483647...
                                  failed...
   5 <     > <  _%n>         |  1:STAR(3)
                                  SPACE can match 2 times out of 2147483647...
                                  failed...
   6 <      > < _%n>         |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   8 <       _> <%n>         |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
   8 <       _> <%n>         |  3:  EXACT <\n>(5)
   9 <       _%n> <>         |  5:  END(0)
Match successful!
Matching REx "\s+\n" against "       _%n"
Matching stclass SPACE against "       _" (8 bytes)
   0 <> <       _%n>         |  1:PLUS(3)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...

I know Perl 5.10+ will immediately fail the regex (without running it) if a newline is not present. I suspect it is using the location of the newline to reduce the amount of searching it does. For all cases above it seems to cleverly reduce the backtracking involved (usually /\s*\n/ against a string of spaces would take exponential-time). Can anyone offer insight into why the \s+ version is so much faster?

Also note that \s*? does not offer any speedup.

Scott Weldon
  • 9,673
  • 6
  • 48
  • 67
rjh
  • 49,276
  • 4
  • 56
  • 63
  • 8
    It doesn't help that `\s` also matches `\n`. A whitespace character that isn't a newline is `[^\S\n]`, or you could use "horizontal whitespace" `\h`. – Borodin Jul 18 '16 at 08:42
  • You can narrow comparison to `/\s*\n/` and `/\s+\n/` [see live](http://rextester.com/DSXF83795). And note that it's only faster if the string doesn't match. In case of a match, it seems to take the same time – Thomas Ayoub Jul 18 '16 at 08:48
  • @ThomasAyoub I don't think that's narrowing the comparison. `\s\s*` should be identical to `\s+` whereas the two your posted are different regexes. However I agree that the performance difference even between the two you posted is surprising! – rjh Jul 18 '16 at 08:53
  • @Borodin `[^\S\n]*\n` is also very slow, though... – rjh Jul 18 '16 at 08:55
  • With the recent outage (http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016) this seems amusingly topical. – rjh Jul 21 '16 at 06:26

4 Answers4

28

First, even if the resulting regex will not keep the same meaning, let's reduces regexes to \s*0 and \s+0 and use (" " x 4) . "_0" as an input. For the sceptics, you can see here that the lag is still present.

Now let's consider the following code:

$x = (" " x 4) . "_ 0";
$x =~ /\s*0/; # The slow line 
$x =~ /\s+0/; # The fast line

Digging a bit with use re debugcolor; we get the following output:

Guessing start of match in sv for REx "\s*0" against "    _0"
Found floating substr "0" at offset 5...
start_shift: 0 check_at: 5 s: 0 endpos: 6 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s*0" against "    _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)
   0 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 4 times out of 2147483647...
                                  failed...
   1 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 3 times out of 2147483647...
                                  failed...
   2 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 2 times out of 2147483647...
                                  failed...
   3 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 1 times out of 2147483647...
                                  failed...
   5 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 0 times out of 2147483647...
   5 <    _0>|  3:  EXACT <0>(5)
   6 <    _0>|  5:  END(0)
Match successful!

-----------------------

Guessing start of match in sv for REx "\s+0" against "    _0"
Found floating substr "0" at offset 5...
start_shift: 1 check_at: 5 s: 0 endpos: 5 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s+0" against "    _0"
Matching stclass POSIXD[\s] against "    _" (5 bytes)
   0 <    _0>|  1:PLUS(3)
                                  POSIXD[\s] can match 4 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed

Perl seems to be optimized for failure. It will first look for constant strings (which only consume O(N)). Here, it'll look for 0 : Found floating substr "0" at offset 5...

Then it'll start with the variable part of the regex, respectivly \s* and \s+, against the whole minimum string to check:

Matching REx "\s*0" against "    _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)
Matching REx "\s+0" against "    _0"
Matching stclass POSIXD[\s] against "    _" (5 bytes) # Only 5 bytes because there should be at least 1 "\s" char

After that it'll look for the first position meeting the stclass requirement, here at position 0.

  • \s*0:
    • starts at 0, find 4 spaces then fail;
    • starts at 1, find 3 spaces then fail;
    • starts at 2, find 2 spaces then fail;
    • starts at 3, find 1 spaces then fail;
    • starts at 4, find 0 spaces then doesn't fail;
    • Find an exact 0
  • \s+0:
    • starts at 0, find 4 spaces then fail. As the minimum number of spaces is not matched, the regex fails instantly.

If you want to have fun with Perl regex optimization, you can consider the following regexes / *\n and / * \n. At first glance, they look the same, have the same meaning... But if you run its against (" " x 40000) . "_\n" the first one will check all possibilities while the second one will look for " \n" and fail immediately.

In a vanilla, non-optimized regex engine, both regex can cause catastrophic backtracking, since they need to retry the pattern as it bumps along. However, in the example above, the second doesn't fail with Perl because it have been optimized to find floating substr "0%n"


You can see another example on Jeff Atwood's blog.

Note also that the issue is not about \s consideration but any pattern where xx* is used instead of x+ see example with 0s and also regex explosive quantifiers

With such shorter example the behavior is "findable", but if you start to play with complicated patterns, it's far from easy to spot, for example: Regular expression hangs program (100% CPU usage)

Community
  • 1
  • 1
Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
  • In most regex engines `\s+` will also cause catastrophic backtracking in exactly the same way. So my question is how is Perl optimising the `\s+` case to be so much faster? – rjh Jul 18 '16 at 09:30
  • 8
    **Both** can cause catastrophic backtracking (using PCRE engine for example, since it doesn't have optimization for this case: [`\s\s*\n`](https://regex101.com/r/tU6sX9/1), [`\s+\n`](https://regex101.com/r/iY9jT3/1)). The difference here is most likely caused by some optimization which Perl regex is laden with. – nhahtdh Jul 18 '16 at 10:54
  • It's not immediate obvious to me why the engine needs to backtrack. Could you clarify? – jpmc26 Jul 18 '16 at 15:10
  • 2
    @jpmc26 When you write a regex, be sure to be as specific as possible. If I line-up 20 tennis ball in front of you and ask you to pick one and any number of the following you want, there is a **lot** of ways to do complete the task. If I ask you to pick one and *all* the following it limits the possibilities. See the [http://www.regular-expressions.info/catastrophic.html](Runaway Regular Expressions: Catastrophic Backtracking), it explains quite well. I hope I was clear, don't hesitate to ask more – Thomas Ayoub Jul 18 '16 at 15:25
  • 3
    You can't really compare `"0_\n" =~ /0+\n/` to `" _\n" =~ /\s+\n/`. With the former, the optimizer knows that the input must contain the string `0\n`; it doesn't, so the optimizer can bail out before running the full regex engine. With the latter, the optimizer only knows that the input must contain the string `\n`; it does, so the full regex engine has to be run. – ThisSuitIsBlackNot Jul 18 '16 at 18:29
  • @ThisSuitIsBlackNot see [example with `\s`](http://rextester.com/QIKQ91272) – Thomas Ayoub Jul 18 '16 at 18:36
  • The difference there has nothing to do with the optimizer (notice that there's no "Match rejected by optimizer"). – ThisSuitIsBlackNot Jul 18 '16 at 18:41
  • 3
    I'm kind of confused by the description of `\s` matching multiple spaces. Can you clarify that it is trying to find a single space in different positions, and that the problem could be solved by anchoring? – Bergi Jul 18 '16 at 19:07
  • 2
    @ThomasAyoub I appreciate your efforts but frankly you don't seem well-informed about this topic. Your first answer suggested that one version had catastrophic backtracking and the other didn't (incorrect). Your second edited answer was also wrong (as ThisSuitIsBlackNot points out) because one input is missing a zero, so Perl doesn't even execute the regex. I guess nobody knows precisely what the optimisation is. – rjh Jul 19 '16 at 03:53
  • @rjh I hope the edit will answer your questions, if not, good luck in your quest :) – Thomas Ayoub Jul 19 '16 at 09:41
20

When there is a "plus" node (e.g. \s+) at the beginning of a pattern and the node fails to match, the regex engine skips forward to the point of failure and tries again; with \s*, on the other hand, the engine only advances one character at a time.

Yves Orton explains this optimization nicely here:

The start class optimisation has two modes, "try every valid start position" (doevery) and "flip flop mode" (!doevery) where it trys only the first valid start position in a sequence.

Consider /(\d+)X/ and the string "123456Y", now we know that if we fail to match X after matching "123456" then we will also fail to match after "23456" (assuming no evil tricks are in place, which disable the optimisation anyway), so we know we can skip forward until the check /fails/ and only then start looking for a real match. This is flip-flop mode.

/\s+/ triggers flip-flop mode; /\s*/, /\s\s*/, and /\s\s+/ don't. This optimization can't be applied to "star" nodes like \s* because they can match zero characters, so a failure at one point in a sequence isn't indicative of failure later in the same sequence.


You can see this in the debug output for each regex. I've highlighted the skipped characters with ^. Compare this (skips four characters at a time):

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d+x/'
   ...
   0 <> <123 456 78>         |  1:PLUS(3)
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   4 <123 > <456 789 x>      |  1:PLUS(3)
      ^^^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   8 <23 456 > <789 x>       |  1:PLUS(3)
         ^^^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...

to this (skips one or two characters at a time):

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d*x/'
   ...
   0 <> <123 456 78>         |  1:STAR(3)
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   1 <1> <23 456 789>        |  1:STAR(3)
      ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
   2 <12> <3 456 789 >       |  1:STAR(3)
       ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
   4 <123 > <456 789 x>      |  1:STAR(3)
        ^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   5 <123 4> <56 789 x>      |  1:STAR(3)
          ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
   6 <23 45> <6 789 x>       |  1:STAR(3)
          ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
   8 <23 456 > <789 x>       |  1:STAR(3)
           ^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   9 <23 456 7> <89 x>       |  1:STAR(3)
             ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
  10 <23 456 78> <9 x>       |  1:STAR(3)
              ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
  12 <23 456 789 > <x>       |  1:STAR(3)
               ^^
                                  POSIXD[\d] can match 0 times out of 2147483647...
  12 <23 456 789 > <x>       |  3:  EXACT <x>(5)
  13 <23 456 789 x> <>       |  5:  END(0)

Note that the optimization isn't applied to /\s\s+/, because \s+ isn't at the beginning of the pattern. Both /\s\s+/ (logically equivalent to /\s{2,}/) and /\s\s*/ (logically equivalent to /\s+/) probably could be optimized, though; it might make sense to ask on perl5-porters whether either would be worth the effort.


In case you're interested, "flip-flop mode" is enabled by setting the PREGf_SKIP flag on a regex when it's compiled. See the code around lines 7344 and 7405 in regcomp.c and line 1585 in regexec.c in the 5.24.0 source.

Community
  • 1
  • 1
ThisSuitIsBlackNot
  • 23,492
  • 9
  • 63
  • 110
  • Thank you, this is precisely the answer I was looking for (that actually digs into the C source and explains the optimisation). Thank you so much!! – rjh Jul 21 '16 at 06:28
  • 1
    In light of http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016 this seems especially topical! – rjh Jul 21 '16 at 06:28
  • If `R+` works better than `RR`, why doesn't the regex compiler recognize and replace `RR*` with `R+`, at the AST level? – Kaz Dec 05 '16 at 20:14
  • @Kaz I address that in the last paragraph. That optimization could probably be added, but it would require effort on the part of the Perl maintainers for very little benefit. – ThisSuitIsBlackNot Dec 06 '16 at 15:32
15

The \s+\n requires that the character preceding the \n is a SPACE.

According to use re qw(debug) the compilation establishes that it needs a straight string of a known number of spaces, up to the substring \n, which is first checked for in the input. Then it checks the fixed-length space-only substring against the remaining part of input, failing as it comes to _. It's a single possibility to check, regardless of how many spaces the input has. (When there are more _\n each is found to fail equally directly, per debug output.)

Looked at it this way, it's an optimization you'd almost expect, utilizing a rather specific search pattern and lucking out with this input. Except when taken in comparison with other engines, which clearly don't do this kind of analysis.

With \s*\n this is not the case. Once the \n is found and the previous character is not a space, the search hasn't failed since \s* allows nothing (zero characters). There are no fixed-length substrings either, and it's in the backtracking game.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • Does `/\s\s*?\n/` suffer from backtracking? – Zaid Jul 18 '16 at 10:26
  • 2
    @Zaid: Greedy and non-greedy patterns are *identical* with regard to backtracking. The only difference is that a greedy pattern will start off as *long* as possible and be shortened until the rest of the regex matches, while a non-greedy one will start off as *short* as possible and be extended until the rest of the regex matches. – Borodin Jul 18 '16 at 10:40
  • The output of the debugging mode suggests that the engine actually go and match `\s*` or `\s+` from left to right, instead of checking whether there is a `_` before `\n` or not. I have added additional logging to the source code and there is clearly a loop there. – nhahtdh Jul 21 '16 at 03:53
  • @nhahtdh Thank you for the comment -- i didn't realize how the above reads. I had suspected that it may check backwards and wrote that initially but then I found the same, once I looked through `re=debug` output, that it does go left to right (and changed the post). What I meant to say is that it has a fixed-length space-only string that (is traced directly and) bumps into `_`. The very presence of a fixed-length 'stick' in there is huge, it's all very fast then. (Plus, this input is very special.)I think that that's enough of an optimization. I should rephrase that sentence, thank you. – zdim Jul 21 '16 at 05:09
7

I am not sure of the internals of the regular expression engine, but it looks like it does not recognize that \s+ is in some way the same as \s\s*, since in the second one it matches a space, and then tries to match ever growing numbers of spaces, while in the first, it immediately concludes that there is no match.

The output using use re qw( Debug ); clearly shows this, using a much shorter string:

test_re.pl

#!/usr/bin/env perl
use re qw(debug);

$x=(" " x 10) . "_\n";
print '-'x50 . "\n";
$x =~ /\s+\n/;
print '-'x50 . "\n";
$x =~ /\s\s*\n/;
print '-'x50 . "\n";

Output

Compiling REx "\s+\n"
Final program:
    1: PLUS (3)
    2:   SPACE (0)
    3: EXACT <\n> (5)
    5: END (0)
floating "%n" at 1..2147483647 (checking floating) stclass SPACE plus minlen 2
Compiling REx "\s\s*\n"
Final program:
    1: SPACE (2)
    2: STAR (4)
    3:   SPACE (0)
    4: EXACT <\n> (6)
    6: END (0)
floating "%n" at 1..2147483647 (checking floating) stclass SPACE minlen 2
--------------------------------------------------
Guessing start of match in sv for REx "\s+\n" against "          _%n"
Found floating substr "%n" at offset 11...
    start_shift: 1 check_at: 11 s: 0 endpos: 11
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s+\n" against "          _%n"
Matching stclass SPACE against "          _" (11 bytes)
   0 <> <          >         |  1:PLUS(3)
                                  SPACE can match 10 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed
--------------------------------------------------
Guessing start of match in sv for REx "\s\s*\n" against "          _%n"
Found floating substr "%n" at offset 11...
    start_shift: 1 check_at: 11 s: 0 endpos: 11
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s\s*\n" against "          _%n"
Matching stclass SPACE against "          _" (11 bytes)
   0 <> <          >         |  1:SPACE(2)
   1 < > <         _>        |  2:STAR(4)
                                  SPACE can match 9 times out of 2147483647...
                                  failed...
   1 < > <         _>        |  1:SPACE(2)
   2 <  > <        _>        |  2:STAR(4)
                                  SPACE can match 8 times out of 2147483647...
                                  failed...
   2 <  > <        _>        |  1:SPACE(2)
   3 <   > <       _%n>      |  2:STAR(4)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...
   3 <   > <       _%n>      |  1:SPACE(2)
   4 <    > <      _%n>      |  2:STAR(4)
                                  SPACE can match 6 times out of 2147483647...
                                  failed...
   4 <    > <      _%n>      |  1:SPACE(2)
   5 <     > <     _%n>      |  2:STAR(4)
                                  SPACE can match 5 times out of 2147483647...
                                  failed...
   5 <     > <     _%n>      |  1:SPACE(2)
   6 <      > <    _%n>      |  2:STAR(4)
                                  SPACE can match 4 times out of 2147483647...
                                  failed...
   6 <      > <    _%n>      |  1:SPACE(2)
   7 <       > <   _%n>      |  2:STAR(4)
                                  SPACE can match 3 times out of 2147483647...
                                  failed...
   7 <       > <   _%n>      |  1:SPACE(2)
   8 <        > <  _%n>      |  2:STAR(4)
                                  SPACE can match 2 times out of 2147483647...
                                  failed...
   8 <        > <  _%n>      |  1:SPACE(2)
   9 <         > < _%n>      |  2:STAR(4)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   9 <         > < _%n>      |  1:SPACE(2)
  10 <          > <_%n>      |  2:STAR(4)
                                  SPACE can match 0 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed
--------------------------------------------------
Freeing REx: "\s+\n"
Freeing REx: "\s\s*\n"
Scott Weldon
  • 9,673
  • 6
  • 48
  • 67
xxfelixxx
  • 6,512
  • 3
  • 31
  • 38
  • I already attached a similar debug output to my question, but thanks for looking into it. I'm interested in why it's happening :) – rjh Jul 18 '16 at 09:25