I want to read a XML file into python, but there are a lot of Emojis in it and it seems, that python has a problem with that. I spent the last three days with searching google for that problem, but I could not find an answer.
This is a snippet what my XML file looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!-- File Created By Signal -->
<smses count="1">
<sms protocol="0" address="+49 0000 00000" date="1456340389816" type="2" subject="null" body="Party! ��" toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" />
</smses>
And this is what my code looks like:
import xml.dom.minidom as dom
file = '/Users/...'
xmldoc = dom.parse(file)
itemlist = xmldoc.getElementsByTagName('sms')
print(len(itemlist))
for s in itemlist:
print(s.attributes['body'].value)
It works for XML files without emojis. But for the example above it fails already in line 4 with reading the file. So I opened the XML file in Visual Studio and it tells me that �
and �
(which should represent 🎉)are invalide characters. When I replace this both characters with 🎉
which is the HTML Entity (decimal) for 🎉 the XML file looks ok, but python still can't read it. Has anyone an idea how to get this script to run?