0

My user gives a regexp with quantifiers that default to being greedy. He can give any valid regexp. So the solution will have to deal with anything that the user can throw at me.

How do I convert the regexp so any greedy quantifier will be non-greedy?

Does Perl have a (?...:regexp) construct that forces the greedy default for quantifiers into a non-greedy one?

If not: Is there a different way I can force a regexp with greedy quantifiers into a non-greedy one?

E.g., a user may enter:

.*
[.*]
[.*]{4,10}
[.*{4,10}]{4,10}

While these four examples may look similar, they have completely different meanings.

If you simply add ? after every */} you will change the character sets in the last three examples.

Instead they should be changed to/behave like:

.*?
[.*]
[.*]{4,10}?
[.*{4,10}]{4,10}?

but where the matched string is the minimal match, and not first-match, that Perl will default to:

$a="aab";

$a=~/(a.*?b)$/;
# Matches aab, not ab
print $1;

But given the non-greedy regexp, the minimal match can probably be obtained by prepending .*:

$a="aab";

$a=~/.*(a.*?b)$/;
# Matches ab
print $1;
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • 1
    Maybe you should ask about a specific regex, or quantifier. – TLP May 01 '21 at 10:12
  • 1
    Didn't look hard, did you? Second result for "greed" in perlre answers your question – ikegami May 01 '21 at 10:44
  • 1
    @TLP The problem is that the regexp is given by *the* *user*! I cannot control what they enter, so I need a way to do this with *any* regexp. E.g. they may very well enter: '[(?:.*)]' which does not contain any quantifiers, but looks that way - instead it is a character set. – Ole Tange May 01 '21 at 12:06
  • 2
    @ikegami It does *not* answer the question. Re-read the question if you do not understand why. Your reference lists the non-greedy quantifiers. It does not show a general way of converting at greedy regexp to a non-greedy regexp. The naive approach of adding a `?` to every `*` will not work as shown in the question. – Ole Tange May 01 '21 at 12:24
  • 2
    _"construct that forces a greedy regexp into a non-greedy one"_ Does that even exist in _any_ regex flavor? I've never heard of that. Also, as TLP said, this is 99% an XY problem. What exactly do you gain from changing the user's pattern to match something different from what they are looking to match? – 41686d6564 stands w. Palestine May 01 '21 at 13:20
  • 1
    @ikegami "And noone suggested adding ? after every `*`/`}`" So what *did* you suggest by pointing to the reference? Please do not weasel around this and instead quote what you believe would answer the question, and I will explain why it does not answer the question. – Ole Tange May 01 '21 at 22:20
  • @TLP https://savannah.gnu.org/bugs/?60400 I am not sure you will understand it from that, so that is why I tried to sum it up. – Ole Tange May 01 '21 at 22:21
  • Re "*So what did you suggest by pointing to the reference?*", The docs show that a quantifier is made non-greedy by appending a `?`. It's elementary to conclude that the solution is to add a `?` after each quantifier that doesn't already have a `?` modifier. This is what the reference suggest, and therefore what suggested. – ikegami May 10 '21 at 00:32
  • Using non-greediness for purposes other than optimization is a trap anyway. It doesn't do what you want. You think `a.*?b` will prevent `.*` from matching any `a` or `b`? Think again! – ikegami May 10 '21 at 00:37
  • @ikegami Great. So we agree that you did not come with a solution. Not even close. As you can see from the answers even just appending `?` to every quantifier is a non-trivial problem. So if we want to convert `a.*b$` into a non-greedy regexp where `.*` does not match any `a` or `b` this will make the solution even harder. I think you provided the best evidence for why your original comment was completely wrong. – Ole Tange May 10 '21 at 06:57
  • Re "*So we agree that you did not come with a solution.*", I answered the question. /// Re "*just appending ? to every quantifier is a non-trivial problem.*", Correct. And as such, asking to code to do this is not a valid question. Asking us to write a regex parser is way too broad, and asking for library/tools recommendations is also off-topic. Specifically, SO is not a code writing service. I showed what you had to do, but coding it is up to you. If you have more specific questions, feel free to ask those. – ikegami May 10 '21 at 08:38
  • In your last example there's ambiguity. Why couldn't the "minimal" version be `[.*{4,10}?]{4,10}` or `[.*{4,10}?]{4,10}?` – Zaid May 10 '21 at 14:33
  • @Zaid If the input is 0000000000 the first will match all 10, and the second will only match 4. I am looking for the last. – Ole Tange May 10 '21 at 16:21

2 Answers2

7

"Greedyness" is not a property of the whole regular expression. It's a property of a quantifier.

It can be controlled for each quantifier separately. Just add a ? after a quantifier to make it non-greedy, e.g.

[a-z]*?

a{2,3}?

[0-9]??

\s+?

And no, there isn't any built-in way to turn the whole regex to some "default-non-greedy" mode. You need to parse the regex, detect all quantifiers and change them accordingly. Maybe there's a regex-parsing library somewhere on CPAN.


The closest I've found so far is the Regexp::Parser module. I didn't try it, but looks like it could parse the regex, walk the tree, make appropriate changes and then build a modified regex. Please take a look.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Alex Shesterov
  • 26,085
  • 12
  • 82
  • 103
  • A function that changed the quantifies accordingly would answer the question. – Ole Tange May 01 '21 at 22:25
  • 6
    @Ole Tange, Asking us to write a regex parser is *way* too broad, and asking for library/tools recommendations is also off-topic. – ikegami May 02 '21 at 10:11
4

You can use a state machine:

#!/usr/bin/perl

use strict;
use warnings;

my @regexes = ( ".*", "[.*]", "[.*]{4,10}", "[.*{4,10}]{4,10}" );

for (@regexes) {
    print "give: $_\n";
    my $ungreedy = make_ungreedy($_,0);
    print "got:  $ungreedy\n";
    print "============================================\n"
}


sub make_ungreedy {
    my $regex = shift;

    my $class_state  = 0;
    my $escape_state = 0;
    my $found        = 0;
    my $ungreedy     = "";

    for (split (//, $regex)) {
        if ($found) {
            $ungreedy .= "?" unless (/\?/);
            $found = 0;
        }
        $ungreedy .= $_;

        $escape_state = 0, next if ($escape_state);
        $escape_state = 1, next if (/\\/);
        $class_state  = 1, next if (/\[/);
        if ($class_state) {
            $class_state = 0 if (/\]/);
            next;
        }
        $found = 1 if (/[*}+]/);
    }
    $ungreedy .= '?' if $found;
    return $ungreedy;
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Andy A.
  • 1,392
  • 5
  • 15
  • 28
  • 2
    You need to extend it to correctly handle non-capturing groups `(?:group)`, manually renumbered groups `(?|...)`, closing `}` which are not part of a quantifier - e.g. `hello}`, opening `[` as a part of a character class - e.g. `[a[b]`, but also possessive quantifiers, e.g. `X{2,3}+`, Unicode character classes `\p{IsDigit}`, etc. – Alex Shesterov May 07 '21 at 08:23
  • Yes, sure. I'll remark it. But to implementing it would last to long at the moment. Maybe better call another parser out of perl. Then there should be a `/U` flag. – Andy A. May 07 '21 at 16:45
  • 1
    I rolled back the handling of `?` so there shouldn't be problems with non capturing groups etc. – Andy A. May 07 '21 at 19:03