How to: Overlapping match

Question

Lets say we have this:

A2 A1 B.         #1

A1 B.            #2

A3 A1 A8 B.      #3

How would I go about if I want:

To match: A2 A1 B. and A1 B.
To match: A1 B.
To match: A3 A1 A8 B. and A1 A8 B. and A8 B.

So far I've got this regex:

A\d\s(.*\.)

But it won't match subsets of code that's already been matched (I'm matching using re.finditer)/ My guess is that re.finditer is doing just as its supposed to, and I'm just trying to force it into doing stupid stuff.

Playground

One question - why do you expect `B.` to match `A\d\s(.*\.)`? — Tadhg McDonald-Jensen, Nov 21 '16 at 15:47
Possible duplicate of [How to find overlapping matches with a regexp?](http://stackoverflow.com/questions/11430863/how-to-find-overlapping-matches-with-a-regexp) — Florian Brucker, Nov 21 '16 at 15:52
@PatrickHaugh not necessarily. I just need to be able to distinguish `A2 A1 B.` from `A1 B.` etc. I need to perform different operation on them in the code after the regex is done. — Olian04, Nov 21 '16 at 15:52
do you need regex? could you just make a list of substrings and with spit and join? — depperm, Nov 21 '16 at 15:54
@depperm the pattern in the actual code is a bit more complicated than the example. So regex would be preferable. — Olian04, Nov 21 '16 at 15:55

score 2 · Accepted Answer · answered Nov 21 '16 at 15:53

You can use lookahead for this and capture values inside the lookahead:

regex = r"(?=((?:A\d+\s+)+B\.))"

RegEx Demo

RegEx Description:

(?=               # start lookahead
   (              # start capturing group #1
      (?:         # start non-capturing group
         A\d+\s+  # match A followed by 1 or more digit followed by 1 or more whitespace
      )           # end non-capturing group
      +B\.        # match B and literal DOT
   )              # end capture group #1
)                 # end lookahead

Code:

>>> regex = r"(?=((?:A\d+\s+)+B\.))"

>>> print re.findall(regex, 'A2 A1 B.')
['A2 A1 B.', 'A1 B.']

>>> print re.findall(regex, 'A1 B.')
['A1 B.']

>>> print re.findall(regex, 'A3 A1 A8 B.')
['A3 A1 A8 B.', 'A1 A8 B.', 'A8 B.']

How to: Overlapping match

1 Answers1