3

m/.+?(\d{4})<\/i>\)/s

The above regex is slow when I run it over some normal size HTML pages? Why? I wouldn't have thought it should run slowly.

EDIT: Here is a code sample which demonstrates the problem:

use WWW::Mechanize;
my $mech = new WWW::Mechanize;
$mech->get("http://www.elaws.gov.bw/desplaysubsidiary.php?m=SUBSIDIARY&v=I&vp=&id=904");
$page = $mech->content();
$page =~ m/.+?(\d{4})<\/i>\)/s;

The regex line takes forever. If I remove the .+? there is no delay.

CJ7
  • 22,579
  • 65
  • 193
  • 321
  • 3
    Have you considered using a legit parser instead? – hwnd Jun 08 '16 at 05:12
  • @hwnd Fair point, but I am actually interested in why such a simple regex would run so slowly. – CJ7 Jun 08 '16 at 05:18
  • 1
    All you need is `m<(\d{4})\)>`. Is that faster? If you don't see any improvement then it's not the regex that's slow – Borodin Jun 08 '16 at 05:54
  • @Borodin Yes that does completely speed it up. I realise now that the `.+?` is unnecessary, but it shouldn't really cause a problem. – CJ7 Jun 08 '16 at 05:59
  • Not really, non-greedy is slow in a backtracking regex engine. – kennytm Jun 08 '16 at 06:45
  • If you switch on `use re 'debug';` you'll see the steps the regex engine is using. Probably quite a lot, because backtracking. – Sobrique Jun 08 '16 at 08:13
  • 2
    @Wiktor Why did you remove the perl tag and add pcre? There's no indication anywhere that the OP is using PCRE, and [Perl regex and PCRE are not equivalent](http://www.pcre.org/current/doc/html/pcre2compat.html). – ThisSuitIsBlackNot Jun 08 '16 at 14:25
  • 2
    @kennytm *"non-greedy is slow in a backtracking regex engine"* ***Non-greedy*** is fastest if the remainder of the pattern can be found *early* in the string, while ***greedy*** is fastest if the remainder of the pattern can be found *late* in the string. There is nothing inherently slow about non-greedy quantifiers – Borodin Jun 08 '16 at 15:11

2 Answers2

5

There seems to be some misunderstanding about this

Suppose we have a string

my $s = 'xxxxxxxxxx9999</i>)';

then a pattern match like this

$s =~ m<.*?(\d{4})</i>\)>

will start by assuming that .*? takes up no characters at the start of the string. Then it will check to see whether (\d{4})</i>\) matches the string at that point

It fails, so the regex engine gives a single character x to .*? and tries again. This also fails, so the part of the string consumed by .*? is extended character-by-character until it matches the ten characters xxxxxxxxxx. At that point the remainder of the pattern matches successfully and the regex test is declared to be a success

If instead we have a non-lazy pattern

$s =~ m<.*(\d{4})</i>\)>

This will start by assuming that .* takes up all of the string

The remainder of the pattern doesn't match at that point, so backtracking commences again, giving .* all but one character of the string and trying again

This repeats, as before, but shortening the match character-by-character, until a match is found when it has retreated over the trailing nine characters of the string 9999</i>) and .* now matches xxxxxxxxxx as before

Backtracking is going back to a previously-matched pattern element when a match has been found to fail, changing how that element matches and trying again. It isn't going backwards through the object string looking for something

The problem here is caused by the .*? having to be accounted for in the pattern. If we had just m<(\d{4})</i>\)> instead, then there is no backtracking at all. The regex engine simply searches for \d{4}</i>\) and either finds it or it doesn't

This works fine as long as it's the first occurrence of a pattern that you want. Unfortunately, the only way of finding the last occurrence of a substring is to precede it with .*, which kicks off backtracking and makes the process necessarily slower

The above regex is slow when I run it over some normal size HTML pages?

Even so, depending on what your idea of "normal size HTML pages" is, I can't see this taking more than a few milliseconds. The regex engine is coded in C and written to be very fast. I guess you must have run a timer on it to notice any delay at all?

Borodin
  • 126,100
  • 9
  • 70
  • 144
  • So you think that with "non-lazy" patterns it' called backtracking, and for "lazy" patterns it's called lazy pattern expansion – Borodin Jun 08 '16 at 13:18
  • @WiktorStribiżew: The processes are both called backtracking. The only difference is that backtracking into a *lazy* match makes the match longer, while backtracking into a *non-lazy" match makes it shorter. Please can you give me a reference that supports your idea? – Borodin Jun 08 '16 at 13:24
  • 1
    @WiktorStribiżew: It's very hard to demonstrate a negative, but [Regular-Expressions.info](http://www.regular-expressions.info/repeat.html) says this under the heading `Laziness Instead of Greediness` ***"Again, the engine will*** **backtrack.** ***But this time, the backtracking will force the lazy plus to expand rather than reduce its reach."*** – Borodin Jun 08 '16 at 13:26
  • 1
    @WiktorStribiżew From [perlre](http://perldoc.perl.org/perlre.html#Backtracking): "A fundamental feature of regular expression matching involves the notion called backtracking, which is currently used (when needed) by all regular non-possessive expression quantifiers, namely `"*"` , `"*?"`, `"+"`, `"+?"`, `{n,m}`, and `{n,m}?`." – ThisSuitIsBlackNot Jun 08 '16 at 14:58
  • @Borodin I am noticing a delay of several minutes. See my edit to the question for sample code demonstrating the problem. – CJ7 Jun 10 '16 at 05:04
  • I'm vaguely guessing here, but the optimizations done by the engine are probably masking out the difference more than the current explanation. With `.*` Perl knows that it should start by looking for a match on the expression right after it, so it does a straight-up linear search for that (provided it's simple and long enough) instead of the backtracking behavior you describe. – tripleee Jun 10 '16 at 05:08
  • 1
    @tripleee In both `/.*foo/` and `/.*?foo/`, the optimizer searches for the longest floating substring (`foo`). If it can't find it, the match fails immediately without invoking the full regex engine; in that case, you're correct that there would be no backtracking. However, Borodin's example input *does* contain the longest floating substring, so the best the optimizer can do is say that the match begins at position zero; it still has to invoke the full regex engine, and there is backtracking. – ThisSuitIsBlackNot Jun 10 '16 at 15:28
  • @ThisSuitIsBlackNot Thanks for stepping in with actual facts where I was too lazy. – tripleee Jun 10 '16 at 15:37
2

This regex is slow because you introduce lazyness to . with +?. For a 20 lines long html hello world it raises the steps from 60 (when greedy) to 2000 (when lazy).

Imagine what it would do with "some normal size HTML pages". You can test here (under regex debugger).

Also take a look at why you shouldn't use regex to parse html.

Community
  • 1
  • 1
Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
  • 2
    *""This regex is slow because you introduce lazyness to `.` with `+?`"* This isn't true. A "lazy" quantifier starts by matching as little as possible, and backtracks by extending that match by one character at a time until the rest of the pattern succeeds. A "non-lazy" quantifier does the opposite -- it matches as much as possible and backtracks by matching one character less until the pattern matches. Which of those is fastest depends on whether the remainder of the pattern matches the object string closer to the beginning or the end. Neither is inherently slow or fast – Borodin Jun 08 '16 at 12:05
  • *"For a 20 lines long html hello world it raises the steps from 60 (when greedy) to 2000 (when lazy)"* I'm not sure what you mean here. are you saying there are twenty lines of 100 characters? I can't see where you get the 60 from at all – Borodin Jun 08 '16 at 15:13
  • 3
    As Borodin said, the issue is not that the quantifier is lazy. Changing to a greedy quantifier can actually be slower if there's a match early in the string. The OP probably wants simply `/(\d{4})<\/i>\)/s`. Also, the "don't parse HTML with regex" answer you linked to [doesn't really explain anything](http://meta.stackoverflow.com/q/261561); something like [this](http://htmlparsing.com/regexes) is much more useful. – ThisSuitIsBlackNot Jun 08 '16 at 15:52