1

I want to read a XML file into python, but there are a lot of Emojis in it and it seems, that python has a problem with that. I spent the last three days with searching google for that problem, but I could not find an answer.

This is a snippet what my XML file looks like:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?> 
<!-- File Created By Signal -->
<smses count="1">
<sms protocol="0" address="+49 0000 00000" date="1456340389816" type="2" subject="null" body="Party! &#55356;&#57225;" toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" />
</smses>

And this is what my code looks like:

import xml.dom.minidom as dom

file = '/Users/...'
xmldoc = dom.parse(file)
itemlist = xmldoc.getElementsByTagName('sms')
print(len(itemlist))
for s in itemlist:
    print(s.attributes['body'].value)

It works for XML files without emojis. But for the example above it fails already in line 4 with reading the file. So I opened the XML file in Visual Studio and it tells me that &#55356; and &#57225; (which should represent 🎉)are invalide characters. When I replace this both characters with &#127881; which is the HTML Entity (decimal) for 🎉 the XML file looks ok, but python still can't read it. Has anyone an idea how to get this script to run?

Jan
  • 11
  • 2

1 Answers1

1

You need to change those &#55356 and &#57225 to a format that Python understands. Those are Unicode characters, here is a link for the explanation of XML unicode: https://www.w3.org/TR/unicode-xml/. For Python, those characters would be \u5536 and \u57225. Here is a post about Unicode and Python: How to print Unicode character in Python?.

ɯɐɹʞ
  • 1,040
  • 11
  • 17
  • Thank you, but this is not the problem. When you convert the expression above into a string like `'Party! \u5536\u57225'` python prints `'Party! 唶圢5'` but this two unicode characters should represent one emoji. This one: . I found out, that the numbers are the utf-16 decimal representation of but my file is utf-8 encoded. How can I convert the emoji into utf-8? – Jan Jul 21 '17 at 18:06
  • @Jan, this should help: https://stackoverflow.com/questions/31207287/converting-utf-16-to-utf-8 – ɯɐɹʞ Jul 21 '17 at 18:12