Why XML parsing is so difficult?

Question

I am trying to parse this simple document received from EPO-OPS.

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
    <ops:meta name="elapsed-time" value="2"/>
    <exchange-documents>
        <exchange-document system="ops.epo.org" family-id="19768124" country="EP" doc-number="1000000" kind="A1">
            <abstract lang="en">
                <p>The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks. &lt;IMAGE></p>
            </abstract>
        </exchange-document>
    </exchange-documents>
</ops:world-patent-data>

I am doing

import xml.etree.ElementTree as ET
root = ET.parse('pyre.xml').getroot()
for child in root:
    for kid in child:
        for abst in kid:
            for p in abst:
                print (p.text)

Is there any simple way similar to json like:

print (root.exchange-documents.exchange-document.abstract.p.text)

score 2 · Accepted Answer · answered Jul 14 '16 at 11:37

It is much much easier with BeautifulSoup. Try this:

from bs4 import BeautifulSoup

xml = """<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
    <ops:meta name="elapsed-time" value="2"/>
    <exchange-documents>
        <exchange-document system="ops.epo.org" family-id="19768124" country="EP" doc-number="1000000" kind="A1">
            <abstract lang="en">
                <p>The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks. &lt;IMAGE></p>
            </abstract>
        </exchange-document>
    </exchange-documents>
</ops:world-patent-data>"""

"Long" solution:

soup = BeautifulSoup(xml)
for sub_cell_tag in soup.find_all('abstract'):
    print(sub_cell_tag.text)

If you are into one liners:

print('\n'.join([i.text for i in BeautifulSoup(xml).find_all('abstract')]))

Yes, you can find its documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ — Gábor Erdős, Jul 14 '16 at 11:44

score 2 · Answer 2 · answered Jul 14 '16 at 11:40

2

You can use XPath expressions with ElementTree. Note that because you have a global XML namespace defined with xmlns, you need to specify that URL:

tree = ElementTree.parse(…)

namespaces = { 'ns': 'http://www.epo.org/exchange' }
paragraphs = tree.findall('.//ns:abstract/ns:p', namespaces)
for paragraph in paragraphs:
     print(paragraph.text)

answered Jul 14 '16 at 11:40

poke

369,085
72
557
602

Can't we get rid of namespace by using getroot()? – Rahul Jul 14 '16 at 12:11
No, ElementTree has namespaces built into its core and will (correctly) respect those all the time. You could get remove the namespaces after parsing as [discussed in this answer](http://stackoverflow.com/a/25920989/216074), but there’s no built-in solution to just ignore them. – poke Jul 14 '16 at 14:12

Why XML parsing is so difficult?

2 Answers2