Get all matches including overlapping

Question

I want to parse the text in such a way that the brackets with a digit are added to the substring before and after. As far as i understand regex it normally consumes the string which means by default there can't be an overlapping of matches, right? How do i have to adapt pattern_3 to get the desired output?

import re

text = 'a(1)a(2)a(1)a'
pattern = '(a(?:\((\d+)\))?)'
re.findall(pattern, text)
>>> [('a(1)', '1'), ('a(2)', '2'), ('a(1)', '1'), ('a', '')]


pattern_2 = '((?:\((\d+)\))?a(?:\((\d+)\))?)'
re.findall(pattern_2, text)
>>> [('a(1)', '', '1'), ('a(2)', '', '2'), ('a(1)', '', '1'), ('a', '', '')]


pattern_3 = pattern = '((?:\((\d+)\))?a(?=(?:\((\d+)\)))?)'
re.findall(pattern_3, text)
>>> [('a', '', '1'), ('(1)a', '1', '2'), ('(2)a', '2', '1'), ('(1)a', '1', '')]


# desired output:
>>> [('a(1)', '', '1'), ('(1)a(2)', '1', '2'), ('(2)a(1)', '2', '1'), ('(1)a', '1', '')]

Update

Looking for a solution using re only

Do you want to use `re` or are you open to using `regex` module instead? — Paolo, Aug 05 '18 at 18:50
maybe this one helps: [`(?:^|(?=\())(?=((?:\((\d+)\))?a?(?:\((\d+)\))?))`](https://regex101.com/r/qPhqaM/2) — bobble bubble, Aug 05 '18 at 19:52
@bobblebubble that works! If you post it as an answer i will accept it. Could you give me some hints on how you it works - especially the first part of the regex looks quite strange to me :) — RandomDude, Aug 05 '18 at 21:58

Michał Turczyn · Answer 1 · 2018-08-05T20:53:55.633

1

You could try this pattern (?=(\(\d+?\)[a-z]\(\d+?\)|[a-z]\(\d+?\)|\(\d+?\)[a-z])), which solves the problem using positive lookahead.

Since look-arounds are assertions, they match, but don't consume string, so it's enough to put capturing group inside them. Then, you can match same part of string multiple times and access those matches with capturing group.

In my solution, there always is one capturing group.

See this for reference: How to find overlapping matches with a regexp?

Demo

edited Aug 05 '18 at 20:53

answered Aug 05 '18 at 19:38

Michał Turczyn

32,028
14
47
69

Your regex pattern returns 6 matches for `a` but there are only 4 in `text`. The only part that should be overlapping are the brackets in between two `a`s. – RandomDude Aug 05 '18 at 21:54

score 1 · Answer 2 · answered Aug 05 '18 at 23:23

You can use:

re.findall(r'(?=\(\d+\)a|a\(\d+\))(?=((?:\((\d+)\))?a(?:\((\d+)\))?)).*?a', s)

Explanations:

The first lookahead checks if there's at least one number between parenthesis around the a.

The second lookahead is here only to captures what you want but since the two \(\d+\) are optional, the first lookahead is needed.

Then you only have to consume characters until the a with .*?a to avoid to match the same a twice.

demo

score 1 · Accepted Answer · answered Aug 06 '18 at 08:31

To get overlapping matches, using capturing groups inside a lookahead is the right idea.

First define the starting point (zero-width). It should be either at start of the string or before an opening parenthesis: (?:^|(?=\()). as we only need a(... at start or (... within or at end.

At this points trigger the lookahead. The pattern for capturing inside the lookahead (?=...) could be like ((?:\((\d+)\))?a?(?:\((\d+)\))?) by making each part optional and a second group inside for extracting the digits. This can also be done by alternating the different options.

(?:^|(?=\())(?=((?:\((\d+)\))?a?(?:\((\d+)\))?))

Here is a demo at regex101

kantal · Answer 4 · 2018-08-05T20:23:13.310

The parenthetis are omitted to make the solution cleaner, but you can put them back:

text='a1a2a1a'

You must filter out empty strings from the result:

re.findall(r"(?=^(a(\d)))|(?=((\d)a)$)|(?=((\d)a(\d)))",text)
     Out:
    [('a1', '1', '', '', '', '', ''),
     ('', '', '', '', '1a2', '1', '2'),
     ('', '', '', '', '2a1', '2', '1'),
     ('', '', '1a', '1', '', '', '')]

Edit: According to @Michał Turczyn, a single lookahead will do it, too:

re.findall(r"(?=^(a(\d))|((\d)a)$|((\d)a(\d)))",text)

Get all matches including overlapping

4 Answers4