Is there a way in regex to find a string if it occurs twice in given structures (i.e. like in XML parsing)? This code obviously does not work as it finds the first tag and then the last closing tag:
re.findall(r'<(.+)>([\s\S]*)</(.+)>', s)
So is there a way to tell regex that the third match should be the same as the first?
Full code:
import re
s = '''<a1>
<a2>
1
</a2>
<b2>
52
</b2>
<c2>
<a3>
Abc
</a3>
</c2>
</a1>
<b1>
21
</b1>'''
matches = re.findall(r'<(.+)>([\s\S]*)</(.+)>', s)
for match in matches:
print(match)
Result should be all the tags with the contents:
[('a1', '\n <a2>\n 1\n </a2>\n <b2>\n 52\n </b2>\n <c2>\n <a3>\n Abc\n </a3>\n </c2>\n'),
('a2', '\n 1\n '),
...]
Note: I am not looking for a complete xml parsing package. The question is specificly about solving the given problem with regex.