Python lxml: Ignore XML declaration (errors)

Question

I am trying to parse the file browser Thunar's custom actions files (~/.config/Thunar/uca.xml) with the lxml Python module.

For some reason, Thunar obviously writes a malformed declaration into these files:

<?xml encoding="UTF-8" version="1.0"?>

Obviously, the version is expected to appear as the first "attribute" in the declaration. lxml raises an XMLSyntaxError if I try to parse the file.

And no, I cannot simply correct the declaration, becaue Thunar keeps overwriting it with the bogus one.

This might very likely be a bug in Thunar.

Nevertheless, I would like to know how to ignore the XML declaration with lxml.

I know that I could pre-process the XML document to filter out the XML declaration. But this doesn't seem very elegant. Since XML seems to default to version 1.0 and UTF-8 encoding, there surely is a possibility to just ignore the declaration and assume that in lxml. I didn't find anything in the documentation or on google, I might have overlooked something.

Can you edit your question and add the complete traceback? The order of attributes is not important. It is only important if the XML document should be well-formed, which is not the same thing as valid. — Burhan Khalid, Jun 04 '17 at 10:22

mzjn · Accepted Answer · 2017-06-05T15:23:26.517

I know very little about Thunar, but if it produces the XML declaration in the question, then that is a bug. Having an incorrect XML declaration makes the document ill-formed.

The XML grammar specifies one correct order for the items in the XML declaration. version must come first and encoding second. See http://w3.org/TR/xml/#NT-XMLDecl.

However, with lxml you can parse using a parser instance that has the recover option set to True. It works in this case. The bad XML declaration is ignored.

from lxml import etree 

parser = etree.XMLParser(recover=True)
tree = etree.parse('uca.xml', parser)

See http://lxml.de/api/lxml.etree.XMLParser-class.html

Python lxml: Ignore XML declaration (errors)

1 Answers1

Linked