Perl - Extract all regular expression match

Question

$regex   = "[M]?[VI]?[A]?[R]?[G]?[D]?[LM]?[G]?[IVMAL]?[E]?";
$text = "VMVARGDLGVE";

if (@matched = $text =~ /$regex/g) {
    $no_of_match = scalar @matched;
    print "No of match: ", $no_of_match, "\n";
    foreach my $i (@matched) {
        print $i, "\n";
    }
}

This program is generating output as follows: No of match: 3 VMV ARGDLGVE

I was expecting this program to output something like: V VM VMV MV MVA MVAR MVARG MVARGD MVARGDL MVARGDLG MVARGDLGV MVARGDLGVE ....

I was trying to get all possible matches as output. How to get all possible matches ?

First of all `use strict` and `use warnings`. Then, you have a lot of unnecessary character classes. And consider using `while` instead of `if`. — Patrick J. S., Jan 02 '15 at 22:27
Oh yeah sorry, you'll have to use another regex as well. I'll look into it. — Patrick J. S., Jan 02 '15 at 22:40

score 1 · Answer 1 · answered Jan 02 '15 at 23:21

Regex matching doesn't work like that. If you search again after it finds a match, it starts searching again from the end of that match. So it doesn't find overlapping matches.

I'm assuming that you're working in Perl 5. I believe there's some way of doing it in Perl 6. Which is really a different language. I don't know of any other language in which it's possible to find overlapping regex matches like you want.

The first match it finds is the substring VMV at the start. Then it starts searching from where it left off, and it finds the match ARGDLGVE. Then it tries again from where it left off, which by this stage is at the end of the string. So it finds the empty substring at the end as a match. (Notice that your regex matches the empty string.) Regexes are forbidden to find the same empty string again, since that would lead to an infinite loop, so it stops searching.

What's the point of this code? Because I really can't see a nice way of doing this, and I hope there's some other way of achieving your goal. You could go through all possible substrings of $text and check each of them individually, using /^$regex$/. You'll find 70 matches, by my count (including duplicates and empty strings). Maybe you want a more restrictive regex?

DavidO · Answer 2 · 2015-01-03T19:15:53.267

I've refactored your code to make it easier to visualize what's going on.

my $regex = qr/
  M?        # 1:  Match M,                  greedy, 0 or 1 times.
  [VI]?     # 2:  Match V or I,             greedy, 0 or 1 times.
  A?        # 3:  Match A,                  greedy, 0 or 1 times.
  R?        # 4:  Match R,                  greedy, 0 or 1 times.
  G?        # 5:  Match G,                  greedy, 0 or 1 times.
  D?        # 6:  Match D,                  greedy, 0 or 1 times.
  [LM]?     # 7:  Match L or M,             greedy, 0 or 1 times.
  G?        # 8:  Match G,                  greedy, 0 or 1 times.
  [IVMAL]?  # 9:  Match I, B, M, A or L,    greedy, 0 or 1 times.
  E?        # 10: Match E,                  greedy, 0 or 1 times.
/x;
my $text = "VMVARGDLGVE";

if ( my @matched = $text =~ /$regex/gx ) {
  print "No of matches: ", scalar(@matched), "\n";
  print "<<$_>>\n" foreach @matched;
}

Your regular expression is matching in exactly as many ways as it possibly can without breaking the rules of its NFA regexp engine.

One of those rules is "leftmost"; the substring closest to the left in the target that allows the total match to succeed will be chosen.
Another rule is that greedy quantifiers will match as much as possible, and will only give up the submatch they own if it becomes necessary to allow the full match to succeed. Giving up part of their match involves backtracking. Backtracking is avoided unless a quantifier has taken too much, and must relinquish it to allow the full match to succeed.
And another rule is that iterative matching resumes at the point that the previous match left off.

Walking through your target string, "V" matches the subpattern [VI]? (hereafter referred to as subpattern #2). The greedy quantifier holds onto that 'V', and will only release it if it is forced to later on, for the greater good.

"M" from the target string matches subpattern #7. And "V" matches subpattern #9. First iteration is done, having matched "VMV".

Now the remainder of your target string looks like "ARGDLGVE". The pos marker is at 3 (the 4th character in the target string), so matching on the second iteration begins there. 'A' matches at subpattern #3, 'R' matches at #4, 'G' matches at #5, 'D' matches at #6, 'L' matches at #7, 'G' matches at #8, 'V' matches at #9, and 'E' matches at #10. The second iteration is done, having matched 'ARGDLGVE' from the target string.

On the third iteration, the pos marker is at 11, which is after the last character in the target string. So the empty string is compared against your regular expression. Because every quantifier in your regexp is "0 or 1", it is acceptable for the regular expression to match an empty string. So the third iteration is done, having matched "" (empty string).

You got three matches: "VMV", "ARGDLGVE", and "".

One thing you may wish to do is to take control of the pos marker. Put your regular expression in a while loop, and before the loop terminates, advance pos one more position from the start of the string. But that only solves the problem you are having with the third rule mentioned above. You still will have the problem that quantifiers do very specific things, and don't violate their own rules just because you think it would be convenient.

The point is that the regular expression engine is not a permutation engine. Its job is to determine if a given target string matches a given pattern, following a set of well-defined (though sometimes confusing) rules.

I'm not sure what the bigger picture problem is that you're trying to solve. If you're simply trying to expand a set of ranges, you might have better success with the CPAN module String::Range::Expand. There are probably other CPAN modules that could do range expansion for you too, but this one could be a good starting point.

score 1 · Accepted Answer · edited May 23 '17 at 10:26

1

This is almost the same question as Count overlapping regex matches in Perl OR Ruby.

This code is nearly unchanged from perldoc perlre, under the section titled "Special Backtracking Control Verbs":

use strict;
use warnings;

my $regex = qr/M?[VI]?A?R?G?D?[LM]?G?[IVMAL]?E?/;
my $text  = 'VMVARGDLGVE';

my $count = 0;
$text =~ /$regex(?{print "$&\n"; $count++})(*FAIL)/g;
print "Got $count matches\n";

The script does count empty string matches to come up with a count of 97 matches.

edited May 23 '17 at 10:26

Community

1
1

answered Jan 05 '15 at 01:12

thisoldman

26
1
2

Excellent use of a backtracking control verb. – DavidO Jan 05 '15 at 15:56
It's interesting that you found 97 matches and I only found 70. We're counting slightly different things: if a substring matches in two different ways, does it count twice? It's not clear which answer OP wanted. If you declare `my %matches;` and add `$matches{(pos $text).','.$^N} = 1;` inside the `(?{ })`, then `scalar keys %matches` will show the number of matching substrings, and it's 70. Or maybe they didn't even want to count duplicate substrings separately (eg only count one `''`): for this, use `$matches{$^N}=1`, and you get `scalar keys %matches` = 56. – David Knipe Jan 05 '15 at 21:11

Perl - Extract all regular expression match

3 Answers3