Regular expression only captures the last occurence of repeated group

Question

I am trying to capture multiple "<attribute> = <value>" pairs with a Python regular expression from a string like this:

  some(code) ' <tag attrib1="some_value" attrib2="value2"                   en=""/>

The regular expression '\s*<tag(?:\s*(\w+)\s*=\"(.*?)\")* is intended to match those pairs multiple times, i.e. return something like

"attrib1", "some_value", "attrib2", "value2", "en", ""

but it only captures the last occurence:

>>> import re
>>> re.search("'\s*<tag(?:\s*(\w+)\s*=\"(.*?)\")*", '  some(code) \' <tag attrib1="some_value" attrib2="value2"                   en=""/>').groups()
('en', '')

Focusing on <attrib>="<value>" works:

>>> re.findall("(?:\s*(\w+)\s*=\"(.*?)\")", '  some(code) \' <tag attrib1="some_value" attrib2="value2"                   en=""/>')
[('attrib1', 'some_value'), ('attrib2', 'value2'), ('en', '')]

so a pragmatic solution might be to test "<tag" in string before running this regular expression, but..

Why does the original regex only capture the last occurence and what needs to be changed to make it work as intended?

The weekly "how to parse html/xml with regex" question... Use an XML parser. Don't try to use a *regular* expression on a language that isn't regular. — DeepSpace, May 09 '17 at 09:03
That's how regex works. It captures only the last occurence. You can't capture an arbitrary number of occurences with regex. Write a loop to apply the regex multiple times, or use an xml parser. — Aran-Fey, May 09 '17 at 09:10
@Rawing Could you elaborate on why it only captures the last occurance of a repeating group in an "answer" or provide some references? If the engine "sees" the repeating group, why does it not capture it? Is there maybe an option to not overwrite the last group-match? — handle, May 09 '17 at 09:32
Related: http://stackoverflow.com/questions/41582889/repeated-capturing-group-pcre, http://stackoverflow.com/questions/37003623/how-to-capture-multiple-repeated-groups, http://www.regular-expressions.info/captureall.html - I'll do some reading... — handle, May 09 '17 at 09:40
@StutiRastogi No, but thanks. BTW: the string is only one of many lines that may or may not contain the data I am looking to extract, so it needs to match `' — handle, May 09 '17 at 09:49
Is there a reason why you can't use a third party XML parser? — ymbirtt, May 09 '17 at 09:57
@ymbirtt Yes: it's not XML, it's just marked-up name=value pairs in source code comments. — handle, May 09 '17 at 10:02
If it's not a known language and isn't necessarily regular, then it's looking similar to an "I need to write my own parser" question. Does my answer at http://stackoverflow.com/questions/42435114/in-python-how-to-parse-a-string-representing-a-set-of-keyword-arguments-such-th/42437175#42437175 help? — ymbirtt, May 09 '17 at 10:23
@ymbirtt Thanks, (py)parsing is of interest indeed, though not so much for the problem at hand. — handle, May 09 '17 at 11:38

score 6 · Accepted Answer · answered May 09 '17 at 09:32

This is just how regex works : you defined one capturing group, so there is only one capturing group. When it first captures something, and then captures an other thing, the first captured item is replaced.That's why you only get the last captured one.
There is no solution for that that I am aware of...

score 1 · Answer 2 · answered Jul 28 '21 at 12:14

Unfortunately this is not possible with python's re module. But regex provides captures and capturesdict functions for that:

>>> m = regex.match(r"(?:(?P<word>\w+) (?P<digits>\d+)\n)+", "one 1\ntwo 2\nthree 3\n")
>>> m.groupdict()
{'word': 'three', 'digits': '3'}
>>> m.captures("word")
['one', 'two', 'three']
>>> m.captures("digits")
['1', '2', '3']
>>> m.capturesdict()
{'word': ['one', 'two', 'three'], 'digits': ['1', '2', '3']}

score -1 · Answer 3 · answered May 09 '17 at 09:07

-1

From the documentation search will return only one occurrence. The findAll method returns all occurrences in the list. That is what you need to use, like in your second example.

answered May 09 '17 at 09:07

Stuti Rastogi

1,162
2
16
26

1

Exactly, but I only need one occurence: the pattern should match the _whole string_, albeit with multiple repetitions of one group. `findall` also only returns the last match. – handle May 09 '17 at 09:10
The pattern _does_ match the whole string _with_ repetitions of the group, only this do not produce multiple match groups, unfortunately. – handle May 09 '17 at 09:28
What are you aiming for? Do you want only one occurrence or all of them? And what is wrong in the second example? I don't know why you want to have – Stuti Rastogi May 09 '17 at 09:34
Thanks, but it's two different things. Have a look at the comments and the other answer, they address the original problem. – handle May 09 '17 at 09:43
I looked at them and understood, your question was not clear to me initially. Good luck with the reading. – Stuti Rastogi May 09 '17 at 09:44

Regular expression only captures the last occurence of repeated group

3 Answers3

Linked