How does Python xml parser detect encoding (utf-8 vs utf-16)?

Asked Dec 15 '20 at 07:07

Active Dec 15 '20 at 08:04

Viewed 568 times

The Python XML Parser can parse byte strings of various encodings (Even if there is no encoding specified in the XML header):

from xml.etree import ElementTree as ET

xml_string = '<doc>Glück</doc>'

xml_utf_8 = xml_string.encode('utf-8')
xml_utf_16 = xml_string.encode('utf-16')

print(ET.fromstring(xml_utf_8).text)
print(ET.fromstring(xml_utf_16).text)

Output:

Glück
Glück

Questions:

Is it safe to let the parser detect the correct encoding (utf-8 vs. utf-16, other encodings fail if not specified in the parser)?
The detection seems to be done in the expat C library. How does it reliably detect the right encoding?

edited Dec 15 '20 at 08:04

aalbagarcia

1,019
7
20

asked Dec 15 '20 at 07:07

Steve

see https://stackoverflow.com/questions/12349728/elementtree-and-unicode – balderman Dec 15 '20 at 07:19
According to this answer, it should default to utf-8, whenever there is no encoding specified in the xml header (This would be the correct behavior according to xml specification). But as I show in the example above, it detects that the second byte-string is utf-16 and the parser does not fall back to utf-8. – Steve Dec 15 '20 at 07:33
The Expat documentation has [a whole chapter explaining this mechanism.](https://libexpat.github.io/doc/expat-internals-encodings/), though understanding it requires a fair amount of knowledge about the library's internals. – tripleee Dec 15 '20 at 08:26
I guess that the parser detects the byte order mark in the case of UTF-16. – mzjn Dec 15 '20 at 08:27
Perhaps see also https://stackoverflow.com/a/377306/874188 – tripleee Dec 15 '20 at 08:28

How does Python xml parser detect encoding (utf-8 vs utf-16)?

0 Answers0