I am trying to capture multiple "<attribute> = <value>" pairs with a Python regular expression from a string like this:
some(code) ' <tag attrib1="some_value" attrib2="value2" en=""/>
The regular expression '\s*<tag(?:\s*(\w+)\s*=\"(.*?)\")*
is intended to match those pairs multiple times, i.e. return something like
"attrib1", "some_value", "attrib2", "value2", "en", ""
but it only captures the last occurence:
>>> import re
>>> re.search("'\s*<tag(?:\s*(\w+)\s*=\"(.*?)\")*", ' some(code) \' <tag attrib1="some_value" attrib2="value2" en=""/>').groups()
('en', '')
Focusing on <attrib>="<value>" works:
>>> re.findall("(?:\s*(\w+)\s*=\"(.*?)\")", ' some(code) \' <tag attrib1="some_value" attrib2="value2" en=""/>')
[('attrib1', 'some_value'), ('attrib2', 'value2'), ('en', '')]
so a pragmatic solution might be to test "<tag" in string
before running this regular expression, but..
Why does the original regex only capture the last occurence and what needs to be changed to make it work as intended?