Python Regex: Capture overlapping parts

Question

Given a string s = "<foo>abcaaa<bar>a<foo>cbacba<foo>c" I'm trying to write a regular expression which will extract portions of: angle brackets with the text inside and the surrounding text. Like this:

<foo>abcaaa
abcaaa<bar>a
a<foo>cbacba
cbacba<foo>c

So expected output should look like this:

["<foo>abcaaa", "abcaaa<bar>a", "a<foo>cbacba", "cbacba<foo>c"]

I found this question How to find overlapping matches with a regexp? which brought me little bit closer to the desired result but still my regex doesn't work.

regex = r"(?=([a-c]*)\<(\w+)\>([a-c]*))"

Any ideas how to solve this problem?

I am sorry, but your expected output is string splitted by newline. — Olvin Roght, Apr 01 '20 at 22:47
Interesting question. When you edit you don't need to say you've done so (i.e., EDIT: not needed). It's better to just revise your question as though you were editing a draft of a text (just don't change the question). — Cary Swoveland, Apr 01 '20 at 23:30

score 2 · Answer 1 · answered Apr 01 '20 at 22:48

You need to set the left- and right-hand boundaries to < or > chars or start/end of string.

Use

import re
text = "<foo>abcaaa<bar>a<foo>cbacba<foo>c"
print( re.findall(r'(?=(?<![^<>])([a-c]*<\w+>[a-c]*)(?![^<>]))', text) )
# => ['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']

See the Python demo online and the regex demo.

Pattern details

(?= - start of a positive lookahead to enable overlapping matches
- (?<![^<>]) - start of string, < or >
- ([a-c]*<\w+>[a-c]*) - Group 1 (the value extracted): 0+ a, b or c chars, then <, 1+ word chars, > and again 0+ a, b or c chars
- (?![^<>]) - end of string, < or > must follow immediately
) - end of the lookahead.

anubhava · Answer 2 · 2020-04-01T23:01:17.220

You may use this regex code in python:

>>> s = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
>>> reg = r'([^<>]*<[^>]*>)(?=([^<>]*))'
>>> print ( [''.join(i) for i in re.findall(reg, s)] )
['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']

RegEx Demo

RegEx Details:

([^<>]*<[^>]*>): Capture group #1 to match 0 or more characters that are not < and > followed by <...> string.
(?=([^<>]*)): Lookahead to assert that we have 0 or more non-<> characters ahead of current position. We have capture group #2 inside this lookahead.

score 2 · Accepted Answer · answered Apr 01 '20 at 23:01

You can match overlapping content with standard regex syntax by using capturing groups inside lookaround assertions, since those may match parts of the string without consuming the matched substring and hence precluding it from further matches. In this specific example, we match either the beginning of the string or a > as anchor for the lookahead assertion which captures our actual targets:

(?:\A|>)(?=([a-c]*<\w+>[a-c]*))

See regex demo.

In python we then use the property of re.findall() to only return matches captured in groups when capturing groups are present in the expression:

text = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
expr = r'(?:\A|>)(?=([a-c]*<\w+>[a-c]*))'
captures = re.findall(expr, text)
print(captures)

Output:

['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']

Python Regex: Capture overlapping parts

3 Answers3