2

Given a string s = "<foo>abcaaa<bar>a<foo>cbacba<foo>c" I'm trying to write a regular expression which will extract portions of: angle brackets with the text inside and the surrounding text. Like this:

<foo>abcaaa
abcaaa<bar>a
a<foo>cbacba
cbacba<foo>c

So expected output should look like this:

["<foo>abcaaa", "abcaaa<bar>a", "a<foo>cbacba", "cbacba<foo>c"]

I found this question How to find overlapping matches with a regexp? which brought me little bit closer to the desired result but still my regex doesn't work.

regex = r"(?=([a-c]*)\<(\w+)\>([a-c]*))"

Any ideas how to solve this problem?

MrZH6
  • 227
  • 1
  • 5
  • 16

3 Answers3

2

You need to set the left- and right-hand boundaries to < or > chars or start/end of string.

Use

import re
text = "<foo>abcaaa<bar>a<foo>cbacba<foo>c"
print( re.findall(r'(?=(?<![^<>])([a-c]*<\w+>[a-c]*)(?![^<>]))', text) )
# => ['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']

See the Python demo online and the regex demo.

Pattern details

  • (?= - start of a positive lookahead to enable overlapping matches
    • (?<![^<>]) - start of string, < or >
    • ([a-c]*<\w+>[a-c]*) - Group 1 (the value extracted): 0+ a, b or c chars, then <, 1+ word chars, > and again 0+ a, b or c chars
    • (?![^<>]) - end of string, < or > must follow immediately
  • ) - end of the lookahead.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
2

You may use this regex code in python:

>>> s = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
>>> reg = r'([^<>]*<[^>]*>)(?=([^<>]*))'
>>> print ( [''.join(i) for i in re.findall(reg, s)] )
['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']

RegEx Demo

RegEx Details:

  • ([^<>]*<[^>]*>): Capture group #1 to match 0 or more characters that are not < and > followed by <...> string.
  • (?=([^<>]*)): Lookahead to assert that we have 0 or more non-<> characters ahead of current position. We have capture group #2 inside this lookahead.
anubhava
  • 761,203
  • 64
  • 569
  • 643
2

You can match overlapping content with standard regex syntax by using capturing groups inside lookaround assertions, since those may match parts of the string without consuming the matched substring and hence precluding it from further matches. In this specific example, we match either the beginning of the string or a > as anchor for the lookahead assertion which captures our actual targets:

(?:\A|>)(?=([a-c]*<\w+>[a-c]*))

See regex demo.

In python we then use the property of re.findall() to only return matches captured in groups when capturing groups are present in the expression:

text = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
expr = r'(?:\A|>)(?=([a-c]*<\w+>[a-c]*))'
captures = re.findall(expr, text)
print(captures)

Output:

['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']
oriberu
  • 1,186
  • 9
  • 6