4

I want to find the position in a string, where a regular expression stops matching.

Simple example:

my $x = 'abcdefghijklmnopqrstuvwxyz';
$x =~ /gho/;

This example shall give me the position of the character 'h' because 'h' matches and 'o' is the first nonmatching character.

I thought of using pos or $- but it is not written on unsuccessful match. Another solution would be to iteratively shorten the regex pattern until it matches but that's very ugly and doesn't work on complex patterns.

EDIT:

Okay for the linguists: I'm sorry for my awful explanation.

To clarify my situation: If you think of a regular expression as a finite automaton, there is a point, where the testing interrupts, because a character doesn't fit. This point is what I'm searching for.

Use of iterative paranthesis (as mentioned by eugene y) is a nice idea, but it doesn't work with quantifiers and I had to edit the pattern.

Are there other ideas?

Hachi
  • 3,237
  • 1
  • 21
  • 29
  • 3
    Shortening the regex only works with very simple patterns. Think about a pattern line `gh[^a-c]` or `g(?=h(?=j))hj` – Bart Kiers Oct 10 '11 at 12:03
  • 3
    Where would your regex "stop matching" against the string "ghXgXg"? – pilcrow Oct 10 '11 at 12:10
  • 2
    One may argue the pattern *doesn't match* on **every position of the string**. So, you want to refine the question: when does *some* of pattern match the string, but not *all* of it. Assuming this a simplified example (or you can easily loop over characters), you need to figure out what is "some". Then, maybe you can match for "some" as a first step? For example, in this case, it looks like you may be looking for `g(?!ho)`, but this also doesn't answer your full question. – Kobi Oct 10 '11 at 12:55
  • 1
    The `perlre` manual page vaguely hints at possibilities to hook into the regular expression engine. For a start, investigate the output of `perl -Mre=debug` on (a very pared-down version of) your script. In this case, though, it will simply do a literal search for "gho" over the whole string, and fail. – tripleee Oct 10 '11 at 15:37
  • 1
    You don’t want the pattern to match. You want only a part of it to match. That is much different. – tchrist Oct 10 '11 at 16:55

5 Answers5

4

You can get the matching part, and use the index function to find its position:

my $x = 'abcdefghijklmnopqrstuvwxyz';

$x =~ /(g(h(o)?)?)/;
print index($x, $1) + length($1), "\n"; #8
Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
4

What you are proposing is difficult but doable.

If I can paraphrase what I understand, you are wanting to find out how far a failing match got into a match. In order to do this, you need to be able to parse a regex.

The best regex parser is probably to use Perl itself with the -re=debug command line switch:

$ perl -Mre=debug -e'"abcdefghijklmnopqr"=~/gh[ijkl]{5}/'
Compiling REx "gh[ijkl]{5}"
Final program:
   1: EXACT <gh> (3)
   3: CURLY {5,5} (16)
   5:   ANYOF[i-l][] (0)
  16: END (0)
anchored "gh" at 0 (checking anchored) minlen 7 
Guessing start of match in sv for REx "gh[ijkl]{5}" against "abcdefghijklmnopqr"
Found anchored substr "gh" at offset 6...
Starting position does not contradict /^/m...
Guessed: match at offset 6
Matching REx "gh[ijkl]{5}" against "ghijklmnopqr"
   6 <bcdef> <ghijklmnop>    |  1:EXACT <gh>(3)
   8 <defgh> <ijklmnopqr>    |  3:CURLY {5,5}(16)
                                  ANYOF[i-l][] can match 4 times out of 5...
                                  failed...
Match failed
Freeing REx: "gh[ijkl]{5}"

You can shell out that Perl command line with your regex and parse the return of stdout. Look for the `

Here is a matching regex:

$ perl -Mre=debug -e'"abcdefghijklmnopqr"=~/gh[ijkl]{3}/'
Compiling REx "gh[ijkl]{3}"
Final program:
   1: EXACT <gh> (3)
   3: CURLY {3,3} (16)
   5:   ANYOF[i-l][] (0)
  16: END (0)
anchored "gh" at 0 (checking anchored) minlen 5 
Guessing start of match in sv for REx "gh[ijkl]{3}" against "abcdefghijklmnopqr"
Found anchored substr "gh" at offset 6...
Starting position does not contradict /^/m...
Guessed: match at offset 6
Matching REx "gh[ijkl]{3}" against "ghijklmnopqr"
   6 <bcdef> <ghijklmnop>    |  1:EXACT <gh>(3)
   8 <defgh> <ijklmnopqr>    |  3:CURLY {3,3}(16)
                                  ANYOF[i-l][] can match 3 times out of 3...
  11 <ghijk> <lmnopqr>       | 16:  END(0)
Match successful!
Freeing REx: "gh[ijkl]{3}"

You will need to build a parser that can handle the return from the Perl re debugger. The left hand and right hand angle braces show the distance into the string as the regex engine is trying to match.

This is not an easy project btw...

Community
  • 1
  • 1
dawg
  • 98,345
  • 23
  • 131
  • 206
  • Great! Does it work for any arbitrary regex or a limited subset? The grammar of the debug output is not that simple. – dawg Oct 13 '11 at 12:47
1

This seems to work. Basically the idea is to split the regex into it's constituent parts and try them sequentially, returning the last matching position. The fixed strings need to be split up, but the character classes and quantifiers can be kept together.

In theory this should work, but it may need tweaking.

use v5.10;
use strict;
use warnings;

my $string = 'abcdefghijklmnopqrstuvwxyz';
my $match  = partial_match($string, qw(g h (?=i) [ijkx]+ [lmn]+ z));
say "match ended at pos $match, character ", substr($string,$match,1);

sub partial_match {
    my $string = shift;
    my @rx = @_;
    my $pos;
    if ($string =~ /$rx[0]/g) {
        $pos = pos $string;
        if (defined $rx[1]) {
            splice @rx, 0, 2, $rx[0] . $rx[1];
            $pos = partial_match($string, @rx) // $pos;
        } else { return $pos }
    } else {
        say "Didn't match $rx[0]";
        return;
    }
}
TLP
  • 66,756
  • 10
  • 92
  • 149
0

How about:

#!/usr/bin/perl 
use Modern::Perl;

my $x = 'abcdefghijklmnopqrstuvwxyz';
my $s = 'gho';
do {
    if ($x =~ /$s/) {
        say "$s matches from $-[0] to $+[0]";
    } else {
        say "$s doesn't match";
    }
} while chop $s;

output:

gho doesn't match
gh matches from 6 to 8
g matches from 6 to 7
 matches from 0 to 0
Toto
  • 89,455
  • 62
  • 89
  • 125
0

I think thats exactly what the pos function is for. NOTE: pos only works if you use the /g flag

my $x = 'abcdefghijklmnopqrstuvwxyz';
my $end = 0;
if( $x =~ /$ARGV[0]/g )
{
    $end = pos($x);
}
print "End of match is: $end\n";

Gives the following output

[@centos5 ~]$ perl x.pl
End of match is: 0
[@centos5 ~]$ perl x.pl def
End of match is: 6
[@centos5 ~]$ perl x.pl xyz
End of match is: 26
[@centos5 ~]$ perl x.pl aaa
End of match is: 0
[@centos5 ~]$ perl x.pl ghi
End of match is: 9
Sodved
  • 8,428
  • 2
  • 31
  • 43
  • Sorry, I misread the question. The actaul question is very tricky, especially if the regex is more complicated than just `/gho/`, especially if it contains `[` or `(`. Should I delete my irrelevant answer? – Sodved Oct 10 '11 at 15:27
  • I liked the possibility to see an example of how `pos` works, as I didn't know about it before - so now I can understand why it also doesn't apply to the question; so thanks for this answer! `:)` – sdaau Jun 08 '12 at 18:26