Parse invalid xml with lxml Python 3

Question

I have a problems parsing invalid XML with lxml in Python3. My current code is (this is just an example for the sake of simplicity. In a real life I have to read and process 100-300MB XML files):

xml_str='''<r>
    <n type="1" id="n1-1">
        <p a="a" 6_x="x">text1</p>
    </n>
    <n type="2" id="n2-1">
        <p a="a" 6_x="x">text2</p>
    </n>
    <n type="1" id="n1-2">
        <p a="a" 6_x="x">text3</p>
    </n>
</r>'''

import lxml.etree
xpath='/r/n[@type="1"]/p/text()'
parser = lxml.etree.XMLParser(recover=True)
tree = lxml.etree.fromstring(xml_str, parser)
r = tree.xpath(xpath)
print(r)

I get the empty result, assuming XMLParser with enabled recover mode just skips invalid xml nodes. My expected result is:

['text1', 'text3']

If I fix XML (namely: rename invalid attributes from 6_x to f.e. z6_x) everything is ok. How could I preprocess XML (probably using custom XMLParser?) to make able to parse XML with lxml? I suppose I should read xml stream and rename invalid attributes before sending this stream to a lxml. Unfortunately, I have no idea how to write this custom parser (have not enough experience for that). One option is to make two passes:

Read file and replace attributes with regex
Parse this corrected file to a lxml

But I'm curious if there is a more efficient approach to do this. Thanks.

Before developing an end use solution you should check *why* you receive invalid XML. By definition, XML is well-formed and valid. Check the data source/software/script as possibly it builds these file(s) treating them as text files with no markup rules and not with W3C-compliant DOM libraries. — Parfait, Feb 18 '17 at 13:36
@Parfait Unfortunately, invalid XML is an inbound restriction that I cannot change. — lospejos, Feb 18 '17 at 13:59
You cannot discuss with vendor or IT unit? The W3C 1.0 specs has been in place for over a decade. How software and programs not have yet adhered is astounding! I would use Python to read and replace lines iteratively before `lxml` processing. It's a golden rule not to [regex X/HTML](http://stackoverflow.com/a/1732454/1422451) documents. — Parfait, Feb 18 '17 at 14:22
Don't call it "invalid XML" because that suggests it's XML. Call it something else. Then create a specification for what it is - a grammar. Then write a parser for that grammar. Yes this is hard work. But that's why people invest in standards, so if you adhere to them, you save lots of effort (and money). If you want to exchange data using proprietary non-standard formats, it's going to cost you. — Michael Kay, Feb 18 '17 at 17:33

Parse invalid xml with lxml Python 3

0 Answers0