The Python XML Parser can parse byte strings of various encodings (Even if there is no encoding specified in the XML header):
from xml.etree import ElementTree as ET
xml_string = '<doc>Glück</doc>'
xml_utf_8 = xml_string.encode('utf-8')
xml_utf_16 = xml_string.encode('utf-16')
print(ET.fromstring(xml_utf_8).text)
print(ET.fromstring(xml_utf_16).text)
Output:
Glück
Glück
Questions:
- Is it safe to let the parser detect the correct encoding (utf-8 vs. utf-16, other encodings fail if not specified in the parser)?
- The detection seems to be done in the expat C library. How does it reliably detect the right encoding?