8

I have a huge file aab.txt whose contents are aaa...aab.

To my great surprise

perl -ne '/a*bb/' < aab.txt

runs (match failure) faster than

perl -ne '/a*b/' < aab.txt

(match success). Why???? Both should first gobble up all the a's, then the second one immediately succeeds, while the first will then have to backtrack over and over again, to fail.

Mark Galeck
  • 6,155
  • 1
  • 28
  • 55

2 Answers2

8

Perl regexes are optimized to rather fail as early as possible, than to succeed as fast as possible. This makes a lot of sense when grepping through a large log file.

There is an optimization that first looks for a constant part of the string, in this case, a “floating” b or bb. This can be checked rather efficiently without having to keep track of backtracking state. No bb is found, and the match aborts right there.

Not so with b. That floating substring is found, and the match constructed from there. Here is the debug output of the regex match (program is "aaab" =~ /a*b/):

Compiling REx "a*b"
synthetic stclass "ANYOF_SYNTHETIC[ab][]".
Final program:
   1: STAR (4)
   2:   EXACT <a> (0)
   4: EXACT <b> (6)
   6: END (0)
floating "b" at 0..2147483647 (checking floating) stclass ANYOF_SYNTHETIC[ab][] minlen 1 
Guessing start of match in sv for REx "a*b" against "aaab"
Found floating substr "b" at offset 3...
start_shift: 0 check_at: 3 s: 0 endpos: 4 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "a*b" against "aaab"
Matching stclass ANYOF_SYNTHETIC[ab][] against "aaab" (4 bytes)
   0 <> <aaab>               |  1:STAR(4)
                                  EXACT <a> can match 3 times out of 2147483647...
   3 <aaa> <b>               |  4:  EXACT <b>(6)
   4 <aaab> <>               |  6:  END(0)
Match successful!
Freeing REx: "a*b"

You can get such output with the debug option for the re pragma.

Finding the b or bb is unnecessary, strictly speaking, but it allows the match to fail much earlier.

amon
  • 57,091
  • 2
  • 89
  • 149
6
/a*bb/

is basically

/^(?s:.*?)a*bb/

Note the two *. Optimizations aside, it's quadratic. In the worst case scenario, (a string of all a), for a string of length N, it will check if the current character is an a N*(N-1)/2 times. We call this O(N2).

It's worth doing a scan of the string (O(N)) to see if it can possibly match before starting the match. It will take a little longer to match, but it will fail to match much faster. This is what Perl does.

When you run the following

perl -Mre=debug -e"'aaaaab' =~ /a*bb/"
  1. You get information about the compilation of the pattern:

    Compiling REx "a*bb"
    synthetic stclass "ANYOF{i}[ab][{non-utf8-latin1-all}]".
    Final program:
       1: STAR (4)
       2:   EXACT <a> (0)
       4: EXACT <bb> (6)
       6: END (0)
    floating "bb" at 0..2147483647 (checking floating) stclass ANYOF{i}[ab][{non-utf8-latin1-all}] minlen 2
    

    The last line indicates it will search for bb in the input before starting to match.

  2. You get information about the evaluation of the pattern:

    Guessing start of match in sv for REx "a*bb" against "aaaaab"
    Did not find floating substr "bb"...
    Match rejected by optimizer
    

    Here you see that check in action.

ikegami
  • 367,544
  • 15
  • 269
  • 518