A good solution, even if I don't understand why it works
To make a long story short, use utf-8
instead of utf-8-sig
and it works even if an UTF-8 encoded BOM is present:
>>> data = b'\xef\xbb\xbf<test/>'
>>> lxml.etree.parse(io.BytesIO(data), parser=lxml.etree.XMLParser(encoding='utf-8'))
<lxml.etree._ElementTree object at 0x7f3403e47730>
Note that it has to be utf-8
, and not utf8
even though the latter is commonly accepted as an alias by Python:
>>> lxml.etree.parse(io.BytesIO(b'\xef\xbb\xbf<test/>'), parser=lxml.etree.XMLParser(encoding='utf8', remove_blank_text=True))
Traceback (most recent call last):
...
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
Background info
lxml is a wrapper around the libxml2 library. For that reason, the encoding
argument passed to XMLParser
is not the name of a Python encoding, but rather an iconv encoding name. I had to dive into the lxml source to figure this out, and could confirm it by checking with e.g. OSF00010004
, which is supported by iconv on my system but not by Python:
>>> lxml.etree.parse(io.BytesIO(b'<test/>'), parser=lxml.etree.XMLParser(encoding='OSF00010004'))
<lxml.etree._ElementTree object at 0x7f8baa6adc30>
>>> b'<test/>'.decode('OSF00010004')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: OSF00010004
We can list the supported encodings using iconv -l
, but there's no equivalent of Python's BOM-stripping utf-8-sig
. Apparently passing utf-8
is good enough.
It's worth knowing that libxml2 works exclusively on UTF-8 encoded strings, as we can learn from the lxml FAQ:
The text encoding that libxml2 uses internally is UTF-8, so parsing from a Unicode file means that Python first reads a chunk of data from the file, then decodes it into a new buffer, and then copies it into a new unicode string object, just to let libxml2 make yet another copy while encoding it down into UTF-8 in order to parse it.
This has performance implications, as the FAQ entry details.
A simple workaround that I do understand
We can first decode and then parse:
response_string = response_bytes_io.read().decode('utf-8-sig')
xml = etree.fromstring(response_string)
As noted above, this is less efficient, because Python strings are not internally stored as UTF-8 so it has to be re-encoded into UTF-8 before libxml2 can use it.
You also need to be aware that this approach will fail if the XML contains an encoding declaration like <?xml version="1.0" encoding="UTF-8"?>
:
ValueError: Unicode strings with encoding declaration are not supported.
Please use bytes input or XML fragments without declaration.
That can be a deal breaker if you're dealing with XML from third party sources.
A better workaround that I do understand
We can also strip off the UTF-8 encoded BOM ourselves, because it's always the three bytes \xef\xbb\xbf
.
Sadly, doing this on a file-like object is a bit more involved than on a string because you can't read ahead. Wrapping the file in an io.BufferedReader
gives you the peek()
function, but you can't control how many bytes it returns.
So the safe approach is to first read everything into a buffer:
response_bytes = response_bytes_io.read()
if response_bytes.startswith(b'\xef\xbb\xbf'):
response_bytes = response_bytes[3:]
parser = etree.XMLParser(encoding='utf-8')
xml = etree.parse(source=io.BytesIO(response_bytes), parser=parser)
This is less efficient than operating on the stream directly, because parsing is delayed until the entire response has been read, but it's still more efficient than having an extra decoding and re-encoding pass.