-1

I am trying to parse an xml file with regular expression. Whichever script tag has "catch" alias, I need to collect "type" and "value".

<script type="abc">
    <line x="word" size="1" alias="catch" value="4" desc="description"/>
</script>
<script type="xyz">
    <line x="state" size="5" alias="catch" value="8" desc="description"/>
</script>

I tried this regular expression with multiline and dotall:

>>> re.findall(r'script\s+type=\"(\w+)\".*alias=\"catch\"\s+value=\"(\d+)\"', a, re.MULTILINE|re.DOTALL)

Output which I am getting is:

[('abc', '8')]

Expected output is:

[('abc', '4'), ('xyz', '8')]

Can someone help me in figuring out what I am missing here?

npatel
  • 1,081
  • 2
  • 13
  • 21
  • 3
    Don't prase xml with regex. [See here](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) for an explaination why. – Nearoo Oct 03 '18 at 18:30
  • I dint quite follow why I shouldn't be using regex in this case. – npatel Oct 03 '18 at 18:36

2 Answers2

1

I recommend using BeautifulSoup. You can parse through the tags and, with a little bit of data re-structuring, easily check for the right alias values and store the related attributes of interest. Like this:

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, "lxml")

to_keep = []
for script in soup.find_all("script"):
    t = script["type"]
    attrs = {
        k:v for k, v in [attr.split("=") 
                         for attr in script.contents[0].split() 
                         if "=" in attr]
    }
    if attrs["alias"] == '"catch"':
        to_keep.append({"type": t, "value": attrs["value"]})

to_keep
# [{'type': 'abc', 'value': '"4"'}, {'type': 'xyz', 'value': '"8"'}]

Data:

data = """<script type="abc">
    <line x="word" size="1" alias="catch" value="4" desc="description"/>
</script>
<script type="xyz">
    <line x="state" size="5" alias="catch" value="8" desc="description"/>
</script>"""
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
0

Found the answer. Thanks downvoter. I don't think there was any need to downvote this question.

>>> re.findall(r'script\s+type=\"(\w+)\".*?alias=\"catch\"\s+value=\"(\d+)\".*?\<\/script\>', a, re.MULTILINE|re.DOTALL)
[('abc', '4'), ('xyz', '8')]
npatel
  • 1,081
  • 2
  • 13
  • 21