12

i have a xml file, and i need to fetch some of the tags from it for some use, which have data like:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
<?xml version="1.0"?>
<data>
    <country name="Liechtenstein1">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria1" direction="E"/>
        <neighbor name="Switzerland1" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia1" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

i need to parse this, so i used:

import xml.etree.ElementTree as ET
tree = ET.parse("myfile.xml")
root = tree.getroot()

this code giving error at line 2: xml.etree.ElementTree.ParseError: junk after document element:

i think this is because multiple xml tags, do you have any idea, how should i parse this?

ggupta
  • 675
  • 1
  • 10
  • 27
  • 1
    "I have a xml file..." No, you don't. Where does the file come from? Is there a possibility of fixing the issue on that side? (It shouldn't be too hard to parse it, but if there's any way to avoid the invalid XML in the first place, that would be better.) – user94559 Aug 03 '17 at 05:15
  • 1
    Together it is not a valid XML file. But you can split it before `` and parse the parts separately. – Klaus D. Aug 03 '17 at 05:16
  • @smarx what do you mean by `is there a possibility..` ? i have given only sample data from the file, it does contain many more root elements like this... @KlausD. searching for the better option. – ggupta Aug 03 '17 at 06:04
  • @ggupta I mean do you control the app that created that file, and can you fix it so it produces valid XML? – user94559 Aug 03 '17 at 12:02
  • @smarx, i don't control the app, i'm an end user, no solutions seems work for me. – ggupta Aug 03 '17 at 13:20
  • 1
    Then just split the file on the ` – user94559 Aug 03 '17 at 13:35

3 Answers3

8

There's a simple trick I've used to parse such pseudo-XML (Wazuh rule files for what it matters) - just temporarily wrap it inside a fake element <whatever></whatever> thus forming a single root over all these "roots".

In your case, rather than having an invalid XML like this:

<data> ... </data>
<data> ... </data>

Just before passing it to the parser temporarily rewrite it as:

<whatever>
    <data> ... </data>
    <data> ... </data>
</whatever>

Then you parse it as usual and iterate <data> elements.

import xml.etree.ElementTree as etree
import pathlib

file = Path('rules/0020-syslog_rules.xml')
data = b'<rules>' + file.read_bytes() + b'</rules>'
etree.fromstring(data)
etree.findall('group')
... array of Elements ...
kravietz
  • 10,667
  • 2
  • 35
  • 27
4

This code fills in details for one approach, if you want them.

The code watches for 'accumulated_xml until it encounters the beginning of another xml document or the end of the file. When it has a complete xml document it calls display to exercise the lxml library to parse the document and report some of the contents.

>>> from lxml import etree
>>> def display(alist):
...     tree = etree.fromstring(''.join(alist))
...     for country in tree.xpath('.//country'):
...         print(country.attrib['name'], country.find('rank').text, country.find('year').text)
...         print([neighbour.attrib['name'] for neighbour in country.xpath('neighbor')])
... 
>>> accumulated_xml = []
>>> with open('temp.xml') as temp:
...     while True:
...         line = temp.readline()
...         if line:
...             if line.startswith('<?xml'):
...                 if accumulated_xml:
...                     display (accumulated_xml)
...                     accumulated_xml = []
...             else:
...                 accumulated_xml.append(line.strip())
...         else:
...             display (accumulated_xml)
...             break
... 
Liechtenstein 1 2008
['Austria', 'Switzerland']
Singapore 4 2011
['Malaysia']
Panama 68 2011
['Costa Rica', 'Colombia']
Liechtenstein1 1 2008
['Austria1', 'Switzerland1']
Singapore 4 2011
['Malaysia1']
Panama 68 2011
['Costa Rica', 'Colombia']
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
  • thanks for this, i was just using the same approach, wonder there is no such python library for this. – ggupta Aug 04 '17 at 07:03
  • 1
    Whenever I use this way of splitting a file I think there must be a better way of expressing it in Python. – Bill Bell Aug 04 '17 at 14:38
3

Question: ... any idea, how should i parse this?

Filter the whole File and split into valid <?xml ... Chunks.
Creates myfile_01, myfile_02 ... myfile_nn.

n = 0
out_fh = None
with open('myfile.xml') as in_fh:
    while True:
        line = in_fh.readline()
        if not line: break

        if line.startswith('<?xml'):
            if out_fh:
                out_fh.close()
            n += 1
            out_fh = open('myfile_{:02}'.format(n))

        out_fh.write(line)

    out_fh.close()

If you want all <country> in one XML Tree:

import re
from xml.etree import ElementTree as ET

with open('myfile.xml') as fh:
    root = ET.fromstring('<?xml version="1.0"?><data>{}</data>'.
                         format(''.join(re.findall('<country.*?</country>', fh.read(), re.S)))
                                )

Tested with Python: 3.4.2

stovfl
  • 14,998
  • 7
  • 24
  • 51
  • thanks for the suggestions, used same approach. thanks – ggupta Aug 04 '17 at 07:05
  • i was just finding the way to get parse the file, not any specific tag, your previous answer was helpful for me, thanks for modifying it. – ggupta Aug 04 '17 at 12:07