There seems to be some misunderstanding about this
Suppose we have a string
my $s = 'xxxxxxxxxx9999</i>)';
then a pattern match like this
$s =~ m<.*?(\d{4})</i>\)>
will start by assuming that .*?
takes up no characters at the start of the string. Then it will check to see whether (\d{4})</i>\)
matches the string at that point
It fails, so the regex engine gives a single character x
to .*?
and tries again. This also fails, so the part of the string consumed by .*?
is extended character-by-character until it matches the ten characters xxxxxxxxxx
. At that point the remainder of the pattern matches successfully and the regex test is declared to be a success
If instead we have a non-lazy pattern
$s =~ m<.*(\d{4})</i>\)>
This will start by assuming that .*
takes up all of the string
The remainder of the pattern doesn't match at that point, so backtracking commences again, giving .*
all but one character of the string and trying again
This repeats, as before, but shortening the match character-by-character, until a match is found when it has retreated over the trailing nine characters of the string 9999</i>)
and .*
now matches xxxxxxxxxx
as before
Backtracking is going back to a previously-matched pattern element when a match has been found to fail, changing how that element matches and trying again. It isn't going backwards through the object string looking for something
The problem here is caused by the .*?
having to be accounted for in the pattern. If we had just m<(\d{4})</i>\)>
instead, then there is no backtracking at all. The regex engine simply searches for \d{4}</i>\)
and either finds it or it doesn't
This works fine as long as it's the first occurrence of a pattern that you want. Unfortunately, the only way of finding the last occurrence of a substring is to precede it with .*
, which kicks off backtracking and makes the process necessarily slower
The above regex is slow when I run it over some normal size HTML pages?
Even so, depending on what your idea of "normal size HTML pages" is, I can't see this taking more than a few milliseconds. The regex engine is coded in C and written to be very fast. I guess you must have run a timer on it to notice any delay at all?