5

I am trying to capture multiple "<attribute> = <value>" pairs with a Python regular expression from a string like this:

  some(code) ' <tag attrib1="some_value" attrib2="value2"                   en=""/>

The regular expression '\s*<tag(?:\s*(\w+)\s*=\"(.*?)\")* is intended to match those pairs multiple times, i.e. return something like

"attrib1", "some_value", "attrib2", "value2", "en", ""

but it only captures the last occurence:

>>> import re
>>> re.search("'\s*<tag(?:\s*(\w+)\s*=\"(.*?)\")*", '  some(code) \' <tag attrib1="some_value" attrib2="value2"                   en=""/>').groups()
('en', '')

Focusing on <attrib>="<value>" works:

>>> re.findall("(?:\s*(\w+)\s*=\"(.*?)\")", '  some(code) \' <tag attrib1="some_value" attrib2="value2"                   en=""/>')
[('attrib1', 'some_value'), ('attrib2', 'value2'), ('en', '')]

so a pragmatic solution might be to test "<tag" in string before running this regular expression, but..

Why does the original regex only capture the last occurence and what needs to be changed to make it work as intended?

handle
  • 5,859
  • 3
  • 54
  • 82
  • The weekly "how to parse html/xml with regex" question... Use an XML parser. Don't try to use a *regular* expression on a language that isn't regular. – DeepSpace May 09 '17 at 09:03
  • You are right, the question is really about regex, not XML. – handle May 09 '17 at 09:05
  • 3
    That's how regex works. It captures only the last occurence. You can't capture an arbitrary number of occurences with regex. Write a loop to apply the regex multiple times, or use an xml parser. – Aran-Fey May 09 '17 at 09:10
  • 1
    @Rawing Could you elaborate on why it only captures the last occurance of a repeating group in an "answer" or provide some references? If the engine "sees" the repeating group, why does it not capture it? Is there maybe an option to not overwrite the last group-match? – handle May 09 '17 at 09:32
  • Related: http://stackoverflow.com/questions/41582889/repeated-capturing-group-pcre, http://stackoverflow.com/questions/37003623/how-to-capture-multiple-repeated-groups, http://www.regular-expressions.info/captureall.html - I'll do some reading... – handle May 09 '17 at 09:40
  • Have you tried, group(0)? Is that what you need? – Stuti Rastogi May 09 '17 at 09:40
  • 1
    @StutiRastogi No, but thanks. BTW: the string is only one of many lines that may or may not contain the data I am looking to extract, so it needs to match `' – handle May 09 '17 at 09:49
  • Is there a reason why you can't use a third party XML parser? – ymbirtt May 09 '17 at 09:57
  • @ymbirtt Yes: it's not XML, it's just marked-up name=value pairs in source code comments. – handle May 09 '17 at 10:02
  • If it's not a known language and isn't necessarily regular, then it's looking similar to an "I need to write my own parser" question. Does my answer at http://stackoverflow.com/questions/42435114/in-python-how-to-parse-a-string-representing-a-set-of-keyword-arguments-such-th/42437175#42437175 help? – ymbirtt May 09 '17 at 10:23
  • @ymbirtt Thanks, (py)parsing is of interest indeed, though not so much for the problem at hand. – handle May 09 '17 at 11:38

3 Answers3

6

This is just how regex works : you defined one capturing group, so there is only one capturing group. When it first captures something, and then captures an other thing, the first captured item is replaced.That's why you only get the last captured one.
There is no solution for that that I am aware of...

Gawil
  • 1,171
  • 6
  • 13
1

Unfortunately this is not possible with python's re module. But regex provides captures and capturesdict functions for that:

>>> m = regex.match(r"(?:(?P<word>\w+) (?P<digits>\d+)\n)+", "one 1\ntwo 2\nthree 3\n")
>>> m.groupdict()
{'word': 'three', 'digits': '3'}
>>> m.captures("word")
['one', 'two', 'three']
>>> m.captures("digits")
['1', '2', '3']
>>> m.capturesdict()
{'word': ['one', 'two', 'three'], 'digits': ['1', '2', '3']}
Teivaz
  • 5,462
  • 4
  • 37
  • 75
-1

From the documentation search will return only one occurrence. The findAll method returns all occurrences in the list. That is what you need to use, like in your second example.

Stuti Rastogi
  • 1,162
  • 2
  • 16
  • 26
  • 1
    Exactly, but I only need one occurence: the pattern should match the _whole string_, albeit with multiple repetitions of one group. `findall` also only returns the last match. – handle May 09 '17 at 09:10
  • The pattern _does_ match the whole string _with_ repetitions of the group, only this do not produce multiple match groups, unfortunately. – handle May 09 '17 at 09:28
  • What are you aiming for? Do you want only one occurrence or all of them? And what is wrong in the second example? I don't know why you want to have – Stuti Rastogi May 09 '17 at 09:34
  • Thanks, but it's two different things. Have a look at the comments and the other answer, they address the original problem. – handle May 09 '17 at 09:43
  • I looked at them and understood, your question was not clear to me initially. Good luck with the reading. – Stuti Rastogi May 09 '17 at 09:44