Get all possible matches for regex (in python)?

Question

I have a regex that can match a string in multiple overlapping possible ways. However, it seems to only capture one possible match in the string, how can I get all possible matches? I've tried finditer with no success, but maybe I'm using it wrong.

The string I'm trying to parse is:

foo-foobar-foobaz

The regex I'm using is:

(.*)-(.*)

>>> s = "foo-foobar-foobaz"
>>> matches = re.finditer(r'(.*)-(.*)', s)
>>> [match.group(1) for match in matches]
['foo-foobar']

I want the match (foo and foobar-foobaz), but it seems to only get (foo-foobar and foobaz).

See http://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches — Ray Toal, Sep 12 '11 at 06:07
@Ray Toal Thanks! I actually viewed that one earlier and upvoted it. — user449511, Sep 12 '11 at 06:10

Tim Pietzcker · Accepted Answer · 2011-09-12T06:18:45.297

No problem:

>>> regex = "([^-]*-)(?=([^-]*))"
>>> for result in re.finditer(regex, "foo-foobar-foobaz"):
>>>     print("".join(result.groups()))
foo-foobar
foobar-foobaz

By putting the second capturing parenthesis in a lookahead assertion, you can capture its contents without consuming it in the overall match.

I've also used [^-]* instead of .* because the dot also matches the separator - which you probably don't want.

ikegami · Answer 2 · 2012-02-17T21:37:37.320

It's not something regex engines tend to be able to do. I don't know if Python can. Perl can using the following:

local our @matches;
"foo-foobar-foobaz" =~ /
    ^(.*)-(.*)\z
    (?{ push @matches, [ $1, $2 ] })
    (*FAIL)
/xs;

This specific problem can probably be solved using the regex engine in many languages using the following technique:

my @matches;
while ("foo-foobar-foobaz" =~ /(?=-(.*)\z)/gsp) {
   push @matches, [ ${^PREMATCH}, $1 ];
}

(${^PREMATCH} refers to what comes before where the regex matched, and $1 refers to what the first () matched.)

But you can easily solve this specific problem outside the regex engine:

my @parts = split(/-/, "foo-foobar-foobaz");
my @matches;
for (1..$#parts) {
   push @matches, [
      join('-', @parts[0..$_-1]),
      join('-', @parts[$_..$#parts]),
   ];
}

Sorry for using Perl syntax, but should be able to get the idea. Translations to Python welcome.

@user449511, Added another method that should be easy to implement in Python. — ikegami, Sep 12 '11 at 06:14

score 1 · Answer 3 · answered Sep 12 '11 at 06:05

1

If you want to detect overlapping matches, you'll have to implement it yourself - essentially, for a string foo

Find the first match that starts at string index i
Run the matching function again against foo[i+1:]
Repeat steps 1 and 2 on the incrementally short remaining portion of the string.

It gets trickier if you're using arbitrary-length capture groups (e.g. (.*)) because you probably don't want both foo-foobar and oo-foobar as matches, so you'd have to do some extra analysis to move i even farther than just +1 each match; you'd need to move it the entire length of the first captured group's value, plus one.

answered Sep 12 '11 at 06:05

Amber

507,862
82
626
550

Not really. You can capture overlapping matches easily using lookahead assertions. – Tim Pietzcker Sep 12 '11 at 06:19
Oh, quite true. Hadn't occurred to me mostly because I was in the "not changing the original pattern" mindset, and thus just suggesting programmatic solutions. +1 for your answer here. :) – Amber Sep 12 '11 at 08:49

Get all possible matches for regex (in python)?

3 Answers3

Linked