Why does replacing \s*
(or even \s\s*
) with \s+
result in such a speedup for this input?
use Benchmark qw(:all);
$x=(" " x 100000) . "_\n";
$count = 100;
timethese($count, {
'/\s\s*\n/' => sub { $x =~ /\s\s*\n/ },
'/\s+\n/' => sub { $x =~ /\s+\n/ },
});
I noticed a slow regex s/\s*\n\s*/\n/g
in my code - when given a 450KB input file consisting of lots of spaces with a few non-spaces here and there, and a final newline at the end - the regex hung and never finished.
I intuitively replaced the regex with s/\s+\n/\n/g; s/\n\s+/\n/g;
and all was well.
But why is it so much faster? After using re Debug => "EXECUTE"
I noticed the \s+
version is somehow optimised to run in only one iteration: http://pastebin.com/0Ug6xPiQ
Matching REx "\s*\n" against " _%n"
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d ][{non-utf8-latin1-all}{unicode_all}] against " _%n" (9 bytes)
0 <> < _%n> | 1:STAR(3)
SPACE can match 7 times out of 2147483647...
failed...
1 < > < _%n> | 1:STAR(3)
SPACE can match 6 times out of 2147483647...
failed...
2 < > < _%n> | 1:STAR(3)
SPACE can match 5 times out of 2147483647...
failed...
3 < > < _%n> | 1:STAR(3)
SPACE can match 4 times out of 2147483647...
failed...
4 < > < _%n> | 1:STAR(3)
SPACE can match 3 times out of 2147483647...
failed...
5 < > < _%n> | 1:STAR(3)
SPACE can match 2 times out of 2147483647...
failed...
6 < > < _%n> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
8 < _> <%n> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
8 < _> <%n> | 3: EXACT <\n>(5)
9 < _%n> <> | 5: END(0)
Match successful!
Matching REx "\s+\n" against " _%n"
Matching stclass SPACE against " _" (8 bytes)
0 <> < _%n> | 1:PLUS(3)
SPACE can match 7 times out of 2147483647...
failed...
I know Perl 5.10+ will immediately fail the regex (without running it) if a newline is not present. I suspect it is using the location of the newline to reduce the amount of searching it does. For all cases above it seems to cleverly reduce the backtracking involved (usually /\s*\n/
against a string of spaces would take exponential-time). Can anyone offer insight into why the \s+
version is so much faster?
Also note that \s*?
does not offer any speedup.