3

I have been trying to write a regular expression to validate a file to make sure that it follows a specific format. The file should have a version(); line and then be followed by one or more element(); blocks.

Here is an example of a valid file:

version(1.0);

element
(
);

element
(
);

element
(
);

As a test I created the following Perl example:

use strict;
use warnings;

my $text = <<'END_TEXT';
version(1.0);

element
(
);

garbage <--- THIS SHOULD NOT MATCH!

element
(
);

element
(
);

END_TEXT

my $rx_defs = qr{(?(DEFINE)
    (?<valid_text>
        \A\s*(?&version)\s*
        (?: (?&element) \s* )+
        \s*\Z
    )
    (?<version>
        version\(.+?\);
    )
    (?<element>
        element\s*
        (?&element_body);
    )
    (?<element_body>
        \( (?: [^()]++ | (?&element_body) )* \)
    )
)}xms;

if ($text =~ m/(?&valid_text)$rx_defs/) {
    print "match";
}

As you can see, there is a line of "garbage" in the text that should make it not valid, but for some reason Perl still seems to think that this text is valid! When I run this code it produces the output:

match

I have spent hours trying to track down what is wrong with my regular expression, but I just don't see it. I even tested the exact regex using an online regular expression tester and according to the test my regular expression should be working fine! (Try removing the line of "garbage" if you want to see that it does match correctly when the format is valid.)

This has had me thoroughly stumped all day and is making me wonder whether there is a bug in the Perl regular expression engine itself. Can somebody please tell me why this is matching when it shouldn't?

I am using perl v5.20.1

tjwrona1992
  • 8,614
  • 8
  • 35
  • 98
  • 1
    This bit looks dubious: `version\(.+?\);` – the parens could include an element and the garbage. Consider restricting the contents, e.g. to `[^)]+`. – amon Sep 21 '17 at 21:32
  • That seems to fix it... but how??? a non-greedy match should stop as soon as it is satisfied! – tjwrona1992 Sep 21 '17 at 21:34
  • Also why would this match in perl, but not on the regex testing website using "PCRE" as the regex type? They should function identically shouldn't they? – tjwrona1992 Sep 21 '17 at 21:36
  • A very cool regex, no question, and regex is a great tool, no question ... but I can't help it but saying: this is hard while it can be avoided, by using tools that handle those nested/balanced delimiters. (The exact point of [this answer](https://stackoverflow.com/a/46121634/4653379) in your previous question.) – zdim Sep 21 '17 at 21:36
  • 3
    Tip: It's never an engine/language/compiler bug; it's always a user bug. Well ... unless ... but that one's the last thing on a long, long list of possible causes of behavior. (I'd re-consider a title like that; are you really sure?) – zdim Sep 21 '17 at 21:41
  • Well either perl is broken or regex101.com is broken because they disagree. – tjwrona1992 Sep 21 '17 at 21:43
  • Consider the regex `/a(.*?)xyz/`. The non-greedy match in the capture does not mean the matching will stop at the first `'x'`, it means it will stop at the first `'x'` followed by `'yz'`. Similarly, in your regex `\(.+?\)` the `.+?` can include matched `)` characters, depending on what the regex says must follow. – Grant McLean Sep 21 '17 at 21:44
  • PCRE and Perl have entirely different regex engines. They try to have fairly compatible syntax, but the more exotic features you use, the more differences you will see. Here, exact backtracking behaviour seems to be one difference. – amon Sep 21 '17 at 21:49

2 Answers2

4

From the PCRE documentation at http://www.pcre.org/current/doc/html/pcre2compat.html:

  1. Subroutine calls (whether recursive or not) were treated as atomic groups up to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking into subroutine calls is now supported, as in Perl.

regex101 uses PHP to run PCRE. According to http://php.net/manual/en/pcre.installation.php, PHP only supports PCRE1 (the 8.x branch). Thus regex101 doesn't support backtracking into subroutine calls.

... which is exactly what happens here:

  • we go into (?&valid_text>) and try to match \A\s*(?&version)\s*
  • \A (beginning-of-string) and \s* (optional whitespace) are simple
  • (?&version) does version\(.+?\);
  • this matches the following part of the input:

    version();
    
    element
    (
    );
    

    version( is matched literally. The next character, ), is consumed by .+? (which requires at least one character to match). Then .+? slowly consumes more and more characters (it's non-greedy) until it reaches );. The first time this happens is after consuming ; element (, so that's where we stop for now.

  • the (?&version) call returns
  • we consume any following whitespace
  • the next part is (?: (?&element) \s* )+, i.e. one or more element, each followed by optional whitespace
  • (?&element) does element\s*, i.e. it must start with element
  • our current position in the input is garbage ..., so this fails

At this point the regex engine tries to backtrack. In PCRE < 10.30, the only parts that can backtrack are \s* (i.e. the "optional whitespace" bits), but matching fewer whitespace characters doesn't lead to a successful match either, so the whole thing fails quickly.

However, in Perl we can backtrack into subroutine calls: We re-enter (?&version) and let .+? match more characters (until the next occurrence of ); is found), then retry (?&element). This eventually lets (?&version) consume garbage and the following element, which in turn allows the whole regex to succeed.

Can somebody please tell me why this is matching when it shouldn't?

I don't understand why you think it shouldn't match. :-)

The only reason it doesn't match in PHP is a limitation in the older PCRE version it uses.

melpomene
  • 84,125
  • 8
  • 85
  • 148
  • The real text that I was trying to parse has text between the parenthesis in `version()`. It appears I may have oversimplified things a bit with my example, but even with other text for that `.+?` to gobble up it still matches. Which I guess is the correct behavior although it is very confusing. – tjwrona1992 Sep 21 '17 at 22:21
  • 1
    @tjwrona1992 Any time you have `.*` or `.+` in a regex, it's a potential bug. `.*` can and will skip over any text if that's what it takes to make the whole regex succeed (unless it's limited by a `(?> )` group or a `(*PRUNE)` verb or similar). NB: Greediness doesn't affect whether or where a regex matches; it only affects the length of the match. – melpomene Sep 21 '17 at 22:23
  • Makes sense, I was never aware of that so that's really good to know. I always thought making it non-greedy would make it stop there. – tjwrona1992 Sep 21 '17 at 22:23
3

A non-greedy match doesn't stop as soon as it's satisfied. It attempts continuing as soon as possible. If the rest of the regex fails to match, backtracking will still occur – but for a non-greedy quantifier, backtracking means matching more.

One possibility to avoid this lies in backtracking control. For example, you might want to disallow backtracking once the version has initially matched. We can do this via the (?> ...) construct. This matches the contained pattern independently from the outside patterns. If the rest of the pattern fails, backtracking will not continue into the contained pattern, but will skip over the whole contained pattern. Describing this is a bit difficult, please see perldoc perlre for details.

Adding + to a quantifier (like ++, ?+, *+) has a similar effect to (?> ...). Preferring these no-backtracking quantifiers and (?>...) groups is highly advisable in efficient regexes.

Specifically, replace

(?<valid_text>
    \A\s*(?&version)\s*
    (?: (?&element) \s* )+
    \s*\Z
)

with

(?<valid_text>
    \A\s*(?>(?&version))\s*
    (?: (?&element) \s* )++
    \s*\Z
)

As another alternative, you can use the (*PRUNE) backtracking control verb. Once a PRUNE command is encountered, no backtracking past that point can occur. This commits the match to the alternatives chosen so far.

(?<valid_text>
    \A\s*(?&version)\s* (*PRUNE)
    (?: (?&element) \s* )+
    \s*\Z
)
amon
  • 57,091
  • 2
  • 89
  • 149