I am still struggling with regexp:
import re
text = '''
<SW-VARIABLE>
<SHORT-NAME>abc</SHORT-NAME>
<CATEGORY>VALUE</CATEGORY>
<SW-ARRAYSIZE>
<VF>4</VF>
</SW-ARRAYSIZE>
<SW-DATA-DEF-PROPS>
cde
</SW-DATA-DEF-PROPS>
</SW-VARIABLE>
<SW-VARIABLE>
<SHORT-NAME>def</SHORT-NAME>
<CATEGORY>VALUE</CATEGORY>
<SW-ARRAYSIZE>
<VF>8</VF>
</SW-ARRAYSIZE>
<SW-DATA-DEF-PROPS>
<HELLO>dsfadsf </HELLO>
<NO>itis</NO>
</SW-DATA-DEF-PROPS>
</SW-VARIABLE>
'''
pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>.*<SW-ARRAYSIZE>\s*<VF>([^<]*)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>'
print(re.findall(pattern, text, re.S))
This returns:
[('abc', '8')]
I would expect it to return:
[('abc', '4'), ('def', '8')]
Why is it so greedy and matches everything until the last closing tag?
This is the regex101 link: https://regex101.com/r/ANO7RA/1
Maybe negative lookahead will solve this. I was not able to fully grasp the concept, though... :-(