1

The Python XML Parser can parse byte strings of various encodings (Even if there is no encoding specified in the XML header):

from xml.etree import ElementTree as ET

xml_string = '<doc>Glück</doc>'

xml_utf_8 = xml_string.encode('utf-8')
xml_utf_16 = xml_string.encode('utf-16')

print(ET.fromstring(xml_utf_8).text)
print(ET.fromstring(xml_utf_16).text)

Output:

Glück
Glück

Questions:

  • Is it safe to let the parser detect the correct encoding (utf-8 vs. utf-16, other encodings fail if not specified in the parser)?
  • The detection seems to be done in the expat C library. How does it reliably detect the right encoding?
aalbagarcia
  • 1,019
  • 7
  • 20
Steve
  • 11
  • 3
  • see https://stackoverflow.com/questions/12349728/elementtree-and-unicode – balderman Dec 15 '20 at 07:19
  • According to this answer, it should default to utf-8, whenever there is no encoding specified in the xml header (This would be the correct behavior according to xml specification). But as I show in the example above, it detects that the second byte-string is utf-16 and the parser does not fall back to utf-8. – Steve Dec 15 '20 at 07:33
  • The Expat documentation has [a whole chapter explaining this mechanism.](https://libexpat.github.io/doc/expat-internals-encodings/), though understanding it requires a fair amount of knowledge about the library's internals. – tripleee Dec 15 '20 at 08:26
  • I guess that the parser detects the byte order mark in the case of UTF-16. – mzjn Dec 15 '20 at 08:27
  • Perhaps see also https://stackoverflow.com/a/377306/874188 – tripleee Dec 15 '20 at 08:28

0 Answers0