Perl regex wierd behavior : works with smaller strings fails with longer repetitions of the smaller ones

Question

here is a REGEX in perl that I use to identify strings that match this pattern : include any number of occurrences of any character but single quote ' or backslash , allow only escaped occurrences of ' or , respectively : \' and \ and finally it has to end with a (non-escaped) single quote '

foo.pl

#!/usr/bin/perl
my $line;
my $matchString;
Main();
sub Main() {
        foreach $line( <STDIN> ) {
                $line =~ m/(^(([^\\\']*?(\\\')*?(\\\\)*?)*?\'))/g;
                $matchString = $1;
                print "matchString:$matchString\n"
        }
}

It seems to work fine for strings like :

    ./foo.pl
asasas'
sdsdsdsdsdsd'
\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'
matchString:asasas'
matchString:sdsdsdsdsdsd'
matchString:\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
matchString:\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'

Then I create a file with the following recurring pattern :

AAAAAAAAAAAAAAAAAAAAAAAAAAAAA\\BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\'CCCCCCCCCCCCCCCCCCCCCC\\sdsdsd\\\\\'  ZZZZ\'GGGGGG

By creating a string by repeating this pattern one or more times and adding a single quote ' at the end should match the reg exp. I created a file called zz3 with 16 repetitions of the above pattern. I created then a file called ZZ6 with 18 repetitions of zz3 and another one called ZZ7 with the contents of ZZ6 + one additional instance of zz3, hence 19 repetitions of zz3. By adding a single quote at the end of zz3 it results in a match. By adding a single quote at the end of ZZ6 it also results in a match as expected.
Now here is the tough part, by adding a single quote at the end of ZZ7 does not result in a match!

here is a link to the 3 files : https://drive.google.com/file/d/0BzIKyGguqkWvOWdKaElGRjhGdjg/view?usp=sharing

The perl version I am using is v5.16.3 on FreeBSD bit i tried with various versions on either FreeBSD or linux with identical results. It seems to me that either perl has a problem with the size from 34274 bytes (ZZ6) to 36178 bytes (ZZ7), or I am missing something badly.

This sounds like an XY problem. What are you actually trying to accomplish? — Sobrique, Jun 25 '15 at 10:09
Welcome to Stack Overflow. It's good that you tried to be very explicit, but your question is actually a bit confusing. All that talk about file names and stuff is not relevant. Also external files are discouraged. Please try to boil down your example data to **very short** stuff. One `A` should be enough, it doesn't have to be `AAAAA...`. Then please [edit] your question and show that example data. One that works, one that doesn't. Also, you should `use strict` and `use warnings` and declare your variables lexically *inside* the `sub`. — simbabque, Jun 25 '15 at 10:10
Thanx about the hint to use warnings. I could not replicate the faulty behavior with smaller files, hence all the burden with the links. The "Complex regular subexpression recursion limit (32766) exceeded" msg just won't come up on "good" files of smaller size than ZZ7's. I am not quite sure that perl works correct with those data. — panix Panixgr, Jun 25 '15 at 10:51
@simbabque boiling down to "very short" stuff would not manifest the problem in the first place. So "one A" would not be enough, and also since I could not paste here files of 32K, the use of external files was the only option to show the problem. — panix Panixgr, Jun 26 '15 at 06:30

score 1 · Answer 1 · answered Jun 25 '15 at 10:13

1

Your regular expression leads to catastrophic backtracking because you have nested quantifiers.

If you change it to

(^(([^\\\']*+(\\')*+(\\\\)*+)*?'))

(using possessive quantifiers to avoid backtracking), it should work.

answered Jun 25 '15 at 10:13

Tim Pietzcker

328,213
58
503
561

Thanx, should perl really behave this way? After simbabque's hint about use warnings, I got this message "Complex regular subexpression recursion limit (32766) exceeded" : for the following types of files : files that correctly do not match (zz3, ZZ6 without the last quote), or files that match but are large enough to demonstrate this (ZZ7). – panix Panixgr Jun 25 '15 at 10:56
Did you read the article about catastrophic backtracking? Your regex is constructed in a way that forces the regex engine to check an astronomical number of permutations. Some regex engines error out after a certain threshold is reached, some are determined to go all the way. – Tim Pietzcker Jun 25 '15 at 10:59
Yup I did,nowehere is it documented that perl should fail to match this. I will certainly try your regex with the possessive quantifiers in the bigger picture and come back with results. Shoulnd't it be : (^(([^\\\']*+(\\\')*+(\\\\)*+)*?\') btw? – panix Panixgr Jun 25 '15 at 11:04
Escaping the apostrophe is unnecessary, if that's what you mean. – Tim Pietzcker Jun 25 '15 at 11:05
In my real project, $dataField =~ m/(^(([^\\']*+(\\')*+(\\\\)*+)*?'))/s; seems to suffer as well : prints : "Complex regular subexpression recursion limit (32766) exceeded", as well, and fails to match. – panix Panixgr Jun 25 '15 at 14:17
this (^(([^\\\']|(\\\')|(\\\\))*?\')) (non nested quantifiers) also gives the above message and fails to match. – panix Panixgr Jun 25 '15 at 14:34
See here : http://stackoverflow.com/questions/26226630/latest-perl-wont-match-certain-regexes-more-than-32768-characters-long . Perl just won't allow any repetition quantifier (greedy or not, possesive or not, with backtracking or not) to be repeated more than 32766 chars long. – panix Panixgr Jun 26 '15 at 06:23

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

I just would like to note that the whole problem appeared in an effort to re-engineer an old in-house program to parse escaped PostgreSQL bytea values.

Following this discussion it is clear that perl cannot match any repetition of non dot (.) patterns for more than 32766(=32K-2) times.

The solution is to masquerade the \\ and \' sequences with some chars that are certain to not appear in the input, such as Device Ctrl1 (\x11) and Device Ctrl2 (\x12), (presented as ^Q, ^R in vi respectively) :

$dataField =~ s/\\\\/\x11/g;
$dataField =~ s/\\\'/\x12/g;

then try to match non greedily any input till the first single quote.

$dataField =~ m/(^.*?\')/s;
$matchString = $1;

and finally substitute the above Ctrl chars back to their initial values

$matchString =~ s/\x11/\\\\/g;
$matchString =~ s/\x12/\\\'/g;

This is very fast. Another solution would be to parse till the first single quote and count the number of \'s. If it is even then we have found our last non escaped single quote in the text so we have found our desired match, otherwise the single quote is an escape one and thus considered part of the text, so we keep this value and iterate to the next single quote and repeat the same logic, by concatenating the value to the previous value. This tends to be very slow for big files with many intermediate escaped single quotes. Perl regex's seem to be much faster than Perl code.

Perl regex wierd behavior : works with smaller strings fails with longer repetitions of the smaller ones

2 Answers2