Determining the parts that match a regex in perl

Question

I'm looking for the accumulation of possibly overlapping matches of a regex (the final goal being to do further searches in the resulting substrings).

I want to skip the matches that have already been "accumulated", while avoiding to make copies with substr (I might be wrong about avoiding substr), but the condition that I wrote for it with pos($...) = ... and a next if $... =~ /.../ doesn't work:

#!/usr/bin/env perl

# user inputs
$regexp = "abc|cba|b";
$string = "_abcbabc_bacba";

$length = length($string);
$result = "0" x $length;

while ( pos($string) < $length and $string =~ /$regexp/go ) {
    pos($string) = $-[0] + 1;
    next unless ($len = $+[0] - $-[0]);

#   The failing condition is here:
#    pos($result) = $-[0];
#    next if $result =~ /1{$len}/;

    substr($result, $-[0], $len) = "1" x $len;

    printf "%s\n", $string;
    printf "%".$-[0]."s%s\n", "", "^" x $len;
}
printf "%s\n", $result;

By commenting those lines I can get the desired result which is 01111111010111:

_abcbabc_bacba
 ^^^
_abcbabc_bacba
  ^
_abcbabc_bacba
   ^^^
_abcbabc_bacba
    ^
_abcbabc_bacba
     ^^^
_abcbabc_bacba
      ^
_abcbabc_bacba
         ^
_abcbabc_bacba
           ^^^
_abcbabc_bacba
            ^
01111111010111

But my expected output (with a working condition) would be:

_abcbabc_bacba
 ^^^
_abcbabc_bacba
   ^^^
_abcbabc_bacba
     ^^^
_abcbabc_bacba
         ^
_abcbabc_bacba
           ^^^
01111111010111

notes:

for each iteration I print the original string; the ^ just below show the characters that have been matched in the current iteration.
the 0 & 1 at the end represent the overall result. The characters that have been matched at least once during the process are set to 1.
My commented condition is meant to skip the current match when its corresponding characters are already set to 1 in the result.

So, you are trying to find only the longest overlapping sub match? — brian d foy, Aug 12 '22 at 16:39
Can you add a clear statement of what _precisely_ you want? I see this: find the (possibly overlapping) substrings in the order specified in the alternation, and the earlier ones take precedence. So in `abcbab` with the pattern `abc|cba|b` we find: `abc` and then `cba` (and not `b`), but in `abccba` we'd find `abc` and then `b` (the second char) because it comes before the `cba` in the string (even though it's after it in alternation). Is this correct? — zdim, Aug 12 '22 at 18:27
That would be all the cumulative matches; for the example above the result is (in terms of character positions) `1-7` `9-9` `10-12` — Fravadona, Aug 12 '22 at 18:27
@zdim What I'm looking for is to make my commented lines work so that the output would be the expected one. I'm just starting this small project and it's been years that I didn't write anything in `perl` , so I'll probably modify the code over and over but here I don't understand what's wrong. — Fravadona, Aug 12 '22 at 18:35
What's wrong is that the output depends on things that aren't know yet. You don't know if you should output things until you've exhausted all the possible future matches that might overlap. — brian d foy, Aug 12 '22 at 18:42
@briandfoy I added a few notes in an attempt to explain what my ouput mean and the problem is — Fravadona, Aug 12 '22 at 19:18
This task doesn't really make sense. If you were a manager with nothing but some business requirements, how would you explain it to your developer? There is almost certainly a simpler way to approach this. — lordadmira, Aug 14 '22 at 21:29

score 2 · Answer 1 · answered Aug 12 '22 at 16:47

I think you really want to find the longest overlapping sub match. If you can guarantee that the alternation will have the substrings in the order that you prefer them, that approach might work, but it also requires to know a lot about what is happening besides the match, and in future matches. That is, you don't know if you can output anything until you've the future matches that might overlap, and you can't tell how far into the future you need to look.

You can mess around with pos, but I think I'd just match each substring separately, remember the starting positions, then compare later. Decompose the problem into separate tasks for finding the matching positions and for deciding which ones you want.

Even if I wrote the same code you presented, it's unlikely that I'd remember everything that must happen just right to make it all work out if I had to see it again after a long absence (even if I did highlight @- and @+ in the first chapter of Mastering Perl ;)

use v5.10;
use strict;

my $target      = "_abcbabc_bacba";
my @looking_for = qw( abc cba b );

my @found;

foreach my $want ( @looking_for ) {
    my $pos = 0;
    while( my $found_at = index $target, $want, $pos ) {
        last if $found_at == -1;
        push @found, $found_at;
        $pos = $found_at + 1;
        }
    }

my @found  = sort { $a->[1] <=> $b->[1] } @found;

use Data::Dumper;
say Dumper( \@found );

Now you have a data structure that you can massage any way that you like instead of thinking about all this stuff while in regex land. How you decide to do that is left as an exercise for the reader.

Part of this may be inline. You can build up this data structure to the point where you know that everything you have so far can produce output (i.e. the thing you just matched does not overlap with the prior thing).

Thank you for your insights; the regex will be an user input so it's hard to split it apart. — Fravadona, Aug 12 '22 at 18:28
That's right, but decomposing an arbitrary regex doesn't seem feasible without building an AST. — Fravadona, Aug 12 '22 at 18:53

Determining the parts that match a regex in perl

1 Answers1

Linked