I have a problems parsing invalid XML with lxml in Python3. My current code is (this is just an example for the sake of simplicity. In a real life I have to read and process 100-300MB XML files):
xml_str='''<r>
<n type="1" id="n1-1">
<p a="a" 6_x="x">text1</p>
</n>
<n type="2" id="n2-1">
<p a="a" 6_x="x">text2</p>
</n>
<n type="1" id="n1-2">
<p a="a" 6_x="x">text3</p>
</n>
</r>'''
import lxml.etree
xpath='/r/n[@type="1"]/p/text()'
parser = lxml.etree.XMLParser(recover=True)
tree = lxml.etree.fromstring(xml_str, parser)
r = tree.xpath(xpath)
print(r)
I get the empty result, assuming XMLParser
with enabled recover
mode just skips invalid xml nodes.
My expected result is:
['text1', 'text3']
If I fix XML (namely: rename invalid attributes from 6_x
to f.e. z6_x
) everything is ok.
How could I preprocess XML (probably using custom XMLParser?) to make able to parse XML with lxml? I suppose I should read xml stream and rename invalid attributes before sending this stream to a lxml. Unfortunately, I have no idea how to write this custom parser (have not enough experience for that).
One option is to make two passes:
- Read file and replace attributes with regex
- Parse this corrected file to a lxml
But I'm curious if there is a more efficient approach to do this. Thanks.