I'm not sure whether my problem is with my programming or Apple's iTunes library export, but I'm going to start by assuming it's my programming.
I'm trying to parse the XML library export from iTunes. The fragment that does the parsing is simply:
def parse_tree(source):
parser = ET.iterparse(source)
for action, elem in parser:
# ...
This is failing on the file iTunes gives me when I export my library, with error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 886: character maps to <undefined>
The XML header is <?xml version="1.0" encoding="UTF-8"?>
and the offending fragment of XML appears to be:
<key>Name</key><string>Part 2. The Death Of Enkidu. Skon Přitele Mého Mne Zdeptal Težče</string>
This is rendered just fine by iTunes, the oXygen XML editor and, I see, by Stack Overflow. But the HxD hex editor does show a 0x8D in there, and flags it as undefined in UTF-8. The relevant bit of hex seems to be:
6C 20 54 65 C5 BE C4 8D 65 3C 2F 73 74 72 69 6E
So is this iTunes not exporting valid UTF-8, Python's EventTree not handling it correctly, or me doing something wrong? And how do I get it to read the rest of the element, skipping the character it can't read or replacing it with a default character such as a question mark?
Edit: The OS is Windows 10. The file was created using iTunes export library command, directly to file.