7

I am hoping some Perl gurus can opine on the following. This is the smallest possible example I could find that reproduces my problem:

>./perl -e 'print (("a".("f"x32767)."a") =~ /a(?:[^a]|bb)*a/)'
1

but

>./perl -e 'print (("a".("f"x32768)."a") =~ /a(?:[^a]|bb)*a/)'
>

and I did compile the latest Perl from source, just to see if it would fix the problem:

>./perl -v

This is perl 5, version 20, subversion 1 (v5.20.1) built for i686-linux

Is this a bug (looks like to me)?

Mark Galeck
  • 6,155
  • 1
  • 28
  • 55
  • Note that your pattern is equivalent to `a[^a]*a` – ikegami Oct 07 '14 at 00:29
  • I think you can work around it using something like changing `X*` to `(?:X{32000}){0,32000}X{0,32000}`. – ikegami Oct 07 '14 at 00:33
  • @ikegami yes indeed this pattern in the SSCCE is equivalent. but not in the real-world example that I really need. Same for second question, the string being matched is not really that simple. – Mark Galeck Oct 07 '14 at 05:07

2 Answers2

12

This is a known bug reported since 2002 and it has yet to be fixed. You now know you are not the first person to encounter this bug (or feature, as you will see soon).

From this comment in the bug report, it seems that quantifiers (*, +, {n,m}, {n,}) are designed to have an upper limit on the number of repetitions, which prevents the engine from segfault when the stack used for backtracking overflows, but violates the very definition of Kleene operator in regular expression (repeat the pattern for arbitrary number of times) and gives wrong answer for the query1.

1 In contrast, Java's regex engine (Oracle's implemetation) simply allow StackOverflowError to occur for cases like this, but the quantifier has the upper limit of 232 - 1, which is sufficient for most use case. And there exists a workaround for cases like this, which is to use possessive quantifier.

The same comment also print the regex compilation debugging info, and the output clearly shows that * is translated to {0,32767}. It is also reproducible on my machine (perl v5.10.1 (*) built for x86_64-linux-thread-multi).

$ perl -Mre=debug -wce '/(A|B)*/'
Compiling REx "(A|B)*"
Final program:
   1: CURLYM[1] {0,32767} (15)
   5:   TRIE-EXACT[AB] (13)
        <A>
        <B>
  13:   SUCCEED (0)
  14: NOTHING (15)
  15: END (0)
minlen 0
-e syntax OK
Freeing REx: "(A|B)*"

This following test further confirms the problem, and it shows that perl doesn't let you specify a repetition that exceeds the limit.

$ perl -e 'print (("a".("f"x32767)."a") =~ /a(?:[^a]|bb){0,32767}a/)'
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/a(?:[^a]|bb){ <-- HERE 0,32767}a/ at -e line 1.

Making the quantifier possessive *+ does not solve the problem, since the limit is still there:

$ perl -Mre=debug -wce '/(A|B)*+/'
Compiling REx "(A|B)*+"
Final program:
   1: SUSPEND (19)
   3:   CURLYM[1] {0,32767} (17)
   7:     TRIE-EXACT[AB] (15)
          <A>
          <B>
  15:     SUCCEED (0)
  16:   NOTHING (17)
  17:   SUCCEED (0)
  18: TAIL (19)
  19: END (0)
minlen 0
-e syntax OK
Freeing REx: "(A|B)*+"
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
11

Add use warnings:

use strict;
use warnings;
use feature qw(say);

say "Version is $^V";

say +("a".("f"x32767)."a") =~ /a(?:[^a]|bb)*a/ ? 'matches' : 'no match';

say +("a".("f"x32768)."a") =~ /a(?:[^a]|bb)*a/ ? 'matches' : 'no match';

Outputs:

Complex regular subexpression recursion limit (32766) exceeded at e.pl line 7.
Complex regular subexpression recursion limit (32766) exceeded at e.pl line 9.
Version is v5.20.1
matches
no match

From perldiag

Complex regular subexpression recursion limit (%d) exceeded

(W regexp) The regular expression engine uses recursion in complex situations where back-tracking is required. Recursion depth is limited to 32766, or perhaps less in architectures where the stack cannot grow arbitrarily. ("Simple" and "medium" situations are handled without recursion and are not subject to a limit.) Try shortening the string under examination; looping in Perl code (e.g. with while ) rather than in the regular expression engine; or rewriting the regular expression so that it is simpler or backtracks less. (See perlfaq2 for information on Mastering Regular Expressions.)

Community
  • 1
  • 1
Miller
  • 34,962
  • 4
  • 39
  • 60
  • I asked another question on why "back-tracking is required" as you quote from the docs. It is _not_ _required_ here. It _might_ be needed, when we are going through the greedy quantifier, so during that time, we "record" the backtracking information and that is what causes the bug (even though there is no actual backtracking). Is that right? Then why is it not written in the docs you quote?? It only talks about "required". As far as I can see, this documentation is just terrible... – Mark Galeck Oct 07 '14 at 19:32