0

I am still struggling with regexp:

import re

text = '''
          <SW-VARIABLE>
            <SHORT-NAME>abc</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>4</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
              cde
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>

          <SW-VARIABLE>
            <SHORT-NAME>def</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>8</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
                <HELLO>dsfadsf </HELLO>
                <NO>itis</NO>
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>
'''

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>.*<SW-ARRAYSIZE>\s*<VF>([^<]*)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>'

print(re.findall(pattern, text, re.S))

This returns:

[('abc', '8')]

I would expect it to return:

[('abc', '4'), ('def', '8')]

Why is it so greedy and matches everything until the last closing tag?

This is the regex101 link: https://regex101.com/r/ANO7RA/1

Maybe negative lookahead will solve this. I was not able to fully grasp the concept, though... :-(

mrCarnivore
  • 4,638
  • 2
  • 12
  • 29
  • 4
    I _strongly_ urge you to use a proper XML parser. See [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/a/1732454/354577). While it may be possible to handle specific narrow use cases with regular expressions in general it is **_literally not possible_** to parse XML with regex. It's almost always better to use a proper XML / HTML parser like [`lxml`](https://lxml.de/) or an XML query language like [XPath](https://en.wikipedia.org/wiki/XPath). – ChrisGPT was on strike Feb 17 '20 at 17:32
  • 1
    See also [How do I parse XML in Python?](https://stackoverflow.com/q/1912434/354577) – ChrisGPT was on strike Feb 17 '20 at 17:34
  • I second what @Chris said. I don't know a single person that favors xml instead of json but a few of them tried to use regex. It only generates more problems. Recently I've found [xmltodict](https://github.com/martinblech/xmltodict) and it's super easy to use (I don't like`lxml` either). – Tom Wojcik Feb 17 '20 at 18:00
  • 1
    I also do not favor XML instead of JSON, I do need to make due with the format my source information comes in, though. – mrCarnivore Feb 17 '20 at 18:01

4 Answers4

2

This is the pattern you need.

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<].*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>'

jawad-khan
  • 313
  • 1
  • 10
2

you can also check this out :

import re

text = '''
          <SW-VARIABLE>
            <SHORT-NAME>abc</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>4</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
              cde
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>

          <SW-VARIABLE>
            <SHORT-NAME>def</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>8</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
                <HELLO>dsfadsf </HELLO>
                <NO>itis</NO>
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>
'''
pattern=r'<SW-VARIABLE>\s*<SHORT-NAME>([^<].*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?</SW-VARIABLE>'
print(re.findall(pattern, text, re.S))

output :

[('abc', '4'), ('def', '8')]
Freeman
  • 9,464
  • 7
  • 35
  • 58
1

I agree with others, it is best to use an xml parser here. But to fix what you have ...

You are missing a question mark. regexes are greedy by default. They grab as much as they can. To make them non-greedy, you need to add a question mark after the part that you want to be none-greedy for. This regex will give you what you want:

<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>

you had the question mark correctly after

</SW-ARRAYSIZE>.*

but you were missing it after

</SHORT-NAME>.*

.

I think you want to only capture the content of the two '.*?'s. If that is the case, I would put them in groups and retrieve the groups in code to work with them. The regex will then become:

<SW-VARIABLE>\s*<SHORT-NAME>(?P<sn>[^<]*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>(?P<vf>[^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>

with the two group names being sn and vf. demo

Your python code for retrieving the named groups will then become:

matches= re.search(regex, string1)
print("shortName: ", matches.group('sn'))
print("vf: ", matches.group('vf'))
Barka
  • 8,764
  • 15
  • 64
  • 91
0

I seem to have found an answer myself:

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>\s*<CATEGORY>[^<]*</CATEGORY>\s*<SW-ARRAYSIZE>\s*<VF>(.*)</VF>\s*</SW-ARRAYSIZE>'

print(re.findall(pattern, text))

You really have to limit the usage of .* and make use of the very predictable structure of the XML.

mrCarnivore
  • 4,638
  • 2
  • 12
  • 29