XML parsing with Python and regex does not return all results

Question

I am still struggling with regexp:

import re

text = '''
          <SW-VARIABLE>
            <SHORT-NAME>abc</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>4</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
              cde
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>

          <SW-VARIABLE>
            <SHORT-NAME>def</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>8</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
                <HELLO>dsfadsf </HELLO>
                <NO>itis</NO>
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>
'''

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>.*<SW-ARRAYSIZE>\s*<VF>([^<]*)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>'

print(re.findall(pattern, text, re.S))

This returns:

[('abc', '8')]

I would expect it to return:

[('abc', '4'), ('def', '8')]

Why is it so greedy and matches everything until the last closing tag?

This is the regex101 link: https://regex101.com/r/ANO7RA/1

Maybe negative lookahead will solve this. I was not able to fully grasp the concept, though... :-(

I _strongly_ urge you to use a proper XML parser. See [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/a/1732454/354577). While it may be possible to handle specific narrow use cases with regular expressions in general it is **_literally not possible_** to parse XML with regex. It's almost always better to use a proper XML / HTML parser like [`lxml`](https://lxml.de/) or an XML query language like [XPath](https://en.wikipedia.org/wiki/XPath). — ChrisGPT was on strike, Feb 17 '20 at 17:32
See also [How do I parse XML in Python?](https://stackoverflow.com/q/1912434/354577) — ChrisGPT was on strike, Feb 17 '20 at 17:34
I second what @Chris said. I don't know a single person that favors xml instead of json but a few of them tried to use regex. It only generates more problems. Recently I've found [xmltodict](https://github.com/martinblech/xmltodict) and it's super easy to use (I don't like`lxml` either). — Tom Wojcik, Feb 17 '20 at 18:00
I also do not favor XML instead of JSON, I do need to make due with the format my source information comes in, though. — mrCarnivore, Feb 17 '20 at 18:01

score 2 · Answer 1 · answered Feb 17 '20 at 18:03

2

This is the pattern you need.

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<].*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>'

answered Feb 17 '20 at 18:03

jawad-khan

313
1
10

score 2 · Answer 2 · answered Feb 17 '20 at 18:14

you can also check this out :

import re

text = '''
          <SW-VARIABLE>
            <SHORT-NAME>abc</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>4</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
              cde
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>

          <SW-VARIABLE>
            <SHORT-NAME>def</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>8</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
                <HELLO>dsfadsf </HELLO>
                <NO>itis</NO>
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>
'''
pattern=r'<SW-VARIABLE>\s*<SHORT-NAME>([^<].*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?</SW-VARIABLE>'
print(re.findall(pattern, text, re.S))

output :

[('abc', '4'), ('def', '8')]

Barka · Accepted Answer · 2020-02-17T18:16:06.417

I agree with others, it is best to use an xml parser here. But to fix what you have ...

You are missing a question mark. regexes are greedy by default. They grab as much as they can. To make them non-greedy, you need to add a question mark after the part that you want to be none-greedy for. This regex will give you what you want:

<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>

you had the question mark correctly after

</SW-ARRAYSIZE>.*

but you were missing it after

</SHORT-NAME>.*

.

I think you want to only capture the content of the two '.*?'s. If that is the case, I would put them in groups and retrieve the groups in code to work with them. The regex will then become:

<SW-VARIABLE>\s*<SHORT-NAME>(?P<sn>[^<]*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>(?P<vf>[^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>

with the two group names being sn and vf. demo

Your python code for retrieving the named groups will then become:

matches= re.search(regex, string1)
print("shortName: ", matches.group('sn'))
print("vf: ", matches.group('vf'))

Thanks for the explanation. This does, however, not work for me. Have you tried it with the example? — mrCarnivore, Feb 17 '20 at 17:58
it looks like i had a typo in there. try this: https://regex101.com/r/JHGEek/2 — Barka, Feb 17 '20 at 18:15

score 0 · Answer 4 · answered Feb 17 '20 at 17:53

I seem to have found an answer myself:

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>\s*<CATEGORY>[^<]*</CATEGORY>\s*<SW-ARRAYSIZE>\s*<VF>(.*)</VF>\s*</SW-ARRAYSIZE>'

print(re.findall(pattern, text))

You really have to limit the usage of .* and make use of the very predictable structure of the XML.

XML parsing with Python and regex does not return all results

4 Answers4