4

I want to parse the text in such a way that the brackets with a digit are added to the substring before and after. As far as i understand regex it normally consumes the string which means by default there can't be an overlapping of matches, right? How do i have to adapt pattern_3 to get the desired output?

import re

text = 'a(1)a(2)a(1)a'
pattern = '(a(?:\((\d+)\))?)'
re.findall(pattern, text)
>>> [('a(1)', '1'), ('a(2)', '2'), ('a(1)', '1'), ('a', '')]


pattern_2 = '((?:\((\d+)\))?a(?:\((\d+)\))?)'
re.findall(pattern_2, text)
>>> [('a(1)', '', '1'), ('a(2)', '', '2'), ('a(1)', '', '1'), ('a', '', '')]


pattern_3 = pattern = '((?:\((\d+)\))?a(?=(?:\((\d+)\)))?)'
re.findall(pattern_3, text)
>>> [('a', '', '1'), ('(1)a', '1', '2'), ('(2)a', '2', '1'), ('(1)a', '1', '')]


# desired output:
>>> [('a(1)', '', '1'), ('(1)a(2)', '1', '2'), ('(2)a(1)', '2', '1'), ('(1)a', '1', '')]

Update

Looking for a solution using re only

RandomDude
  • 1,101
  • 18
  • 33

4 Answers4

1

You could try this pattern (?=(\(\d+?\)[a-z]\(\d+?\)|[a-z]\(\d+?\)|\(\d+?\)[a-z])), which solves the problem using positive lookahead.

Since look-arounds are assertions, they match, but don't consume string, so it's enough to put capturing group inside them. Then, you can match same part of string multiple times and access those matches with capturing group.

In my solution, there always is one capturing group.

See this for reference: How to find overlapping matches with a regexp?

Demo

Michał Turczyn
  • 32,028
  • 14
  • 47
  • 69
  • Your regex pattern returns 6 matches for `a` but there are only 4 in `text`. The only part that should be overlapping are the brackets in between two `a`s. – RandomDude Aug 05 '18 at 21:54
1

You can use:

re.findall(r'(?=\(\d+\)a|a\(\d+\))(?=((?:\((\d+)\))?a(?:\((\d+)\))?)).*?a', s)

Explanations:

The first lookahead checks if there's at least one number between parenthesis around the a.

The second lookahead is here only to captures what you want but since the two \(\d+\) are optional, the first lookahead is needed.

Then you only have to consume characters until the a with .*?a to avoid to match the same a twice.

demo

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
1

To get overlapping matches, using capturing groups inside a lookahead is the right idea.

First define the starting point (zero-width). It should be either at start of the string or before an opening parenthesis: (?:^|(?=\()). as we only need a(... at start or (... within or at end.

At this points trigger the lookahead. The pattern for capturing inside the lookahead (?=...) could be like ((?:\((\d+)\))?a?(?:\((\d+)\))?) by making each part optional and a second group inside for extracting the digits. This can also be done by alternating the different options.

(?:^|(?=\())(?=((?:\((\d+)\))?a?(?:\((\d+)\))?))

Here is a demo at regex101

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
0

The parenthetis are omitted to make the solution cleaner, but you can put them back:

text='a1a2a1a'

You must filter out empty strings from the result:

re.findall(r"(?=^(a(\d)))|(?=((\d)a)$)|(?=((\d)a(\d)))",text)
     Out:
    [('a1', '1', '', '', '', '', ''),
     ('', '', '', '', '1a2', '1', '2'),
     ('', '', '', '', '2a1', '2', '1'),
     ('', '', '1a', '1', '', '', '')]

Edit: According to @Michał Turczyn, a single lookahead will do it, too:

re.findall(r"(?=^(a(\d))|((\d)a)$|((\d)a(\d)))",text)
kantal
  • 2,331
  • 2
  • 8
  • 15