0

I'm not sure whether my problem is with my programming or Apple's iTunes library export, but I'm going to start by assuming it's my programming.

I'm trying to parse the XML library export from iTunes. The fragment that does the parsing is simply:

def parse_tree(source):
    parser = ET.iterparse(source)
   for action, elem in parser:
       # ...

This is failing on the file iTunes gives me when I export my library, with error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 886: character maps to <undefined>

The XML header is <?xml version="1.0" encoding="UTF-8"?>and the offending fragment of XML appears to be:

<key>Name</key><string>Part 2. The Death Of Enkidu. Skon Přitele Mého Mne Zdeptal Težče</string>

This is rendered just fine by iTunes, the oXygen XML editor and, I see, by Stack Overflow. But the HxD hex editor does show a 0x8D in there, and flags it as undefined in UTF-8. The relevant bit of hex seems to be:

6C 20 54 65 C5 BE C4 8D 65 3C 2F 73 74 72 69 6E

So is this iTunes not exporting valid UTF-8, Python's EventTree not handling it correctly, or me doing something wrong? And how do I get it to read the rest of the element, skipping the character it can't read or replacing it with a default character such as a question mark?

Edit: The OS is Windows 10. The file was created using iTunes export library command, directly to file.

digitig
  • 1,989
  • 3
  • 25
  • 45
  • When you say *export*, does that entail writing the string to the Windows console? And if so, what version of Python are you using? https://stackoverflow.com/questions/14630288/unicodeencodeerror-charmap-codec-cant-encode-character-maps-to-undefined – BoarGules May 11 '18 at 11:57
  • Those bytes _are_ valid UTF-8. When I convert that hex you posted to bytes I get `b'l Te\xc5\xbe\xc4\x8de – PM 2Ring May 11 '18 at 12:40
  • 1
    For some reason your program is trying to decode those bytes as if they were encoded with the 'charmap' codec, not the UTF-8 codec. I suppose you're using Windows, you don't mention your OS in your question, and you haven't shown us any code that relates directly to Unicode issues. I suspect that you haven't set up your OS to use UTF-8 as the default (I've read that that's a bit tricky to do in Windows). There's probably a simple way to fix this, eg explicitly specify UTF-8 as the encoding when you open the XML file, as illustrated in [this answer](https://stackoverflow.com/a/9233174/4014959). – PM 2Ring May 11 '18 at 12:45
  • @PM2Ring I'll try that, thanks. Though does that indicate a problem with `EventTree`? I'm not opening the file myself, I'm passing the path directly to `ìterparse`, and when I was looking for possible solutions to this problem there was universal deprecation to specifying the encoding of XML files in one's code because it could potentially conflict with what's in the XML prolog (and the parser should default to UTF-8 anyway, as indicated by the W3C specification - sorry, recommendation - for XML). – digitig May 11 '18 at 15:07
  • Hmmm. I would expect it to Just Work properly on Python 3. Sorry, I haven't used Python's XML stuff much, (and that was a while ag) so I have no specific advice.That recommendation makes sense, but if you can pass an open file to an XML parser that should solve this problem. – PM 2Ring May 11 '18 at 15:13
  • FWIW, last time I did XML stuff I used [xml.dom.minidom](https://docs.python.org/3/library/xml.dom.minidom.html). That parser allows you to pass it an open file. – PM 2Ring May 11 '18 at 15:20

0 Answers0