7

How can I obtain the number of overlapping regex matches using Python?

I've read and tried the suggestions from this, that and a few other questions, but found none that would work for my scenario. Here it is:

  • input example string: akka
  • search pattern: a.*k

A proper function should yield 2 as the number of matches, since there are two possible end positions (k letters).

The pattern might also be more complicated, for example a.*k.*a should also be matched twice in akka (since there are two k's in the middle).

Community
  • 1
  • 1
jankes
  • 207
  • 1
  • 7
  • I think you mean **non**-overlapping regex matches. – jcollado Feb 17 '12 at 08:10
  • @jcollado I think I mean overlapping. `b.*o` should be found 4 times in `bboo`. – jankes Feb 17 '12 at 08:33
  • What's wrong with [this answer](http://stackoverflow.com/a/1374893/180174)? – Kimvais Feb 17 '12 at 08:41
  • @Kimvais It returns `1` for both `a.*k` and `a.*k.*a`. – jankes Feb 17 '12 at 08:46
  • 1
    This is pretty easy to do in Perl using just a single pattern match, but I can’t figure out how to make Python behave. For example: `() = "bbboobbooo" =~ /(b.*o)(?{push @all, $1})(*FAIL)/g; printf "got %d matches: %s\n", scalar(@all), "@all";` in Perl will print out `got 21 matches: bbboobbooo bbboobboo bbboobbo bbboo bbbo bboobbooo bboobboo bboobbo bboo bbo boobbooo boobboo boobbo boo bo bbooo bboo bbo booo boo bo`. Is that what you’re looking for? – tchrist Feb 17 '12 at 15:31
  • 1
    @tchrist Wow... If it is that much simpler to do in perl, I'll use that launguage instead. (I hardly know any of those anyway...) – jankes Feb 18 '12 at 15:46
  • @jankes: Note that this is not exactly "pattern matching"; in fact, pattern matching in Python is by default greedy; your regexps have exactly one match, in your examples, because of this greediness. Overlapping matches would be what you have if you try to match `aab` with `a.*b`, for instance (both `aab` and `ab` match, but they overlap in the original string). What you want is more of a syntax parsing; libraries like LEPL (see jcollado's answer) or funcparserlib handle this kind of job. – Eric O. Lebigot Feb 19 '12 at 01:36

2 Answers2

3

I think that what you're looking for is probably better done with a parsing library like lepl:

>>> from lepl import *
>>> parser = Literal('a') + Any()[:] + Literal('k')
>>> parser.config.no_full_first_match()
>>> list(parser.parse_all('akka'))
[['akk'], ['ak']]
>>> parser = Literal('a') + Any()[:] + Literal('k') + Any()[:] + Literal('a')
>>> list(parser.parse_all('akka'))
[['akka'], ['akka']]

I believe that the length of the output from parser.parse_all is what you're looking for.

Note that you need to use parser.config.no_full_first_match() to avoid errors if your pattern doesn't match the whole string.

Edit: Based on the comment from @Shamanu4, I see you want matching results starting from any position, you can do that as follows:

>>> text = 'bboo'
>>> parser = Literal('b') + Any()[:] + Literal('o')
>>> parser.config.no_full_first_match()
>>> substrings = [text[i:] for i in range(len(text))]
>>> matches = [list(parser.parse_all(substring)) for substring in substrings]
>>> matches = filter(None, matches) # Remove empty matches
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results (again)
>>> matches
['bboo', 'bbo', 'boo', 'bo']
jcollado
  • 39,419
  • 8
  • 102
  • 133
  • Thanks. Looks well with `a.*k` and `a.*k.*a` against `akka`, but fails with `b.*o` against `bboo`. – jankes Feb 17 '12 at 09:20
  • @jankes In that case with `parser = Literal('b') + Any()[:] + Literal('o')` I get: `[['bboo'], ['bbo']]`. What would be the right answer? – jcollado Feb 17 '12 at 09:55
  • 2
    @jcollado the right answer should be `[['bbo'],['bboo'],['boo'],['bo']]` – Shamanu4 Feb 17 '12 at 10:39
  • @Shamanu4 Thanks for your comment. I've updated the answer with the information you provided and now it works for the example. – jcollado Feb 17 '12 at 11:26
  • Well, Python regexes aren’t really set up for this sort of power-user use of regexes. If so, a single match should be able to find all 21 ways that `b.*o` can match in `"bbboobbooo"`. More powerful languages like Perl can do this in a single statement, but I guess since Python doesn’t integrate regexes into the language, nor allow proper backtracking control, it’s back to the drawing board of writing yet another library. Ug. – tchrist Feb 17 '12 at 15:37
2

Yes, it is ugly and unoptimized but it seems to be working. This is a simple try of all possible but unique variants

def myregex(pattern,text,dir=0):
    import re
    m = re.search(pattern, text)
    if m:
        yield m.group(0)
        if len(m.group('suffix')):
            for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[1:],m.group('end')),1):
                yield r
            if dir<1 :
                for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[:-1],m.group('end')),-1):
                    yield r


def myprocess(pattern, text):    
    parts = pattern.split("*")    
    for i in range(0, len(parts)-1 ):
        res=""
        for j in range(0, len(parts) ):
            if j==0:
                res+="(?P<prefix>"
            if j==i:
                res+=")(?P<suffix>"
            res+=parts[j]
            if j==i+1:
                res+=")(?P<end>"
            if j<len(parts)-1:
                if j==i:
                    res+=".*"
                else:
                    res+=".*?"
            else:
                res+=")"
        for r in myregex(res,text):
            yield r

def mycount(pattern, text):
    return set(myprocess(pattern, text))

test:

>>> mycount('a*b*c','abc')
set(['abc'])
>>> mycount('a*k','akka')
set(['akk', 'ak'])
>>> mycount('b*o','bboo')
set(['bbo', 'bboo', 'bo', 'boo'])
>>> mycount('b*o','bb123oo')
set(['b123o', 'bb123oo', 'bb123o', 'b123oo'])
>>> mycount('b*o','ffbfbfffofoff')
set(['bfbfffofo', 'bfbfffo', 'bfffofo', 'bfffo'])
Shamanu4
  • 5,296
  • 2
  • 27
  • 31
  • Thank you very much for your time on this... However, there seems to be some indentation error(s?) in the definition of `myprocess` - I'm not sure how to fix this... – jankes Feb 17 '12 at 12:27
  • @jankes Sorry, some indentation was corrupted after using `code` tag. Now it is correct. – Shamanu4 Feb 17 '12 at 13:38
  • Thanks for fixing :) It is now much better - extra letters in the middle or after the matching part have no effect. But it still fails with extra characters at the beginning. – jankes Feb 17 '12 at 13:49
  • Does it honestly take so *incredibly much code* in Python? Really? Why? In Perl or PCRE, it’s just a one-liner. For example, `"bbboobbooo"` should be matched by `b.*o` in 21 possible ways. I think you don’t have good enough backtracking control. – tchrist Feb 17 '12 at 15:34
  • 1
    I think it does the task really well now. But I've realized two things: 1) judging from @tchrist's comments, Python is not the best language for that task; 2) I actually need all - not only the unique matches. – jankes Feb 18 '12 at 17:23
  • @jankes My version will actually include duplicates by default; this can happen with multiple quantifiers in different places. I just didn’t have but one `*` in the example. You would have to trim the dups yourself in post-processing, or include a check in the `push`, like `push @all, $1 unless $seen{$1}++` where both the array (python list) and the hash (python dictionary) were cleared before the regex solver ran. – tchrist Feb 18 '12 at 17:44
  • Ok - as I've already written, I'm amazed by the simplicity of your Perl solution. I've decided to ask a separate question for that - please have a look [here](http://stackoverflow.com/questions/9343986/count-overlapping-regex-matches-in-perl-or-ruby). – jankes Feb 18 '12 at 18:38