fixing invalid XML with Python or Perl: UTF-16 surrogates for emoji encoded in UTF-8

Question

I'm trying to do a numerical analysis of my text messages using all the old backups I could gather. Ideally, emoji will be included in the analysis. I'm using a mix of Python and Perl to get things all into one place, and will likely use R once that's done.

However, I've run into trouble with emoji encoding. Some of my backups were created using the SMS Backup and Restore app on Android to pull my texts as a XML file. I started throwing my XML into this Python SMS module available on github by t413. When the module threw an error in the parser, I put my messages through a validator to see what was up, and the XML wasn't valid due to invalid characters. For an example, part of a text I received that doesn't play well with the XML::Validate module on Perl:

So if we get it out hang out will be short &#55357;&#56852;...

I don't know all the particulars of Unicode, but from what I can tell my text messages include HTML escape characters in UTF-8 for the high/low surrogates. Individually they're invalid characters, but together they do encode . (The XML header does specify UTF-8.)

A lot of these texts are already deleted from my phone (some of these backups are almost a year old) so I can't simply pull them again and see if I can fix the formatting like that.

My question: before I start digging into the particulars of Unicode and HTML escape characters and taking the time to write something to fix this myself (I know from this question that there is a formula to use to convert the surrogates, and that there are encode/decode methods for strings in Python, and various bits and pieces to help out with HTML entities), is there any existing module/built-in function in Python or Perl that can help me fix my file's encoding, or at least get me part of the way there? (Or even a Unix/Linux command line tool that I'm missing.)

Alastair McCormack · Accepted Answer · 2016-07-02T07:19:15.167

Use Python's Beautiful Soup module. This will unescape XML entities, including UTF-16 surrogates.

Assuming the format of the XML, you can do the following to retrieve the body of the message as a Unicode string:

from bs4 import BeautifulSoup

my_xml = """<sms protocol="0" address="09001234567" date="1365481757533" type="2" subject="null"
body="So if we get it out hang out will be short &#55357;&#56852;" toa="null" sc_toa="null" service_center="null"
read="1" status="32" locked="0" date_sent="0" readable_date="2013/04/09 12:29:17"
contact_name="Cute Chic" />"""

soup = BeautifulSoup(my_xml, 'html.parser')

message = soup.sms['body']

print message
print type(message)

Result:

So if we get it out hang out will be short 
<type 'unicode'>

Got it, thanks! Fixed my weird unicode characters and I was able to use the prettify method and Python's built in encode methods to get it into a format that played nicely with the SMS module I'm using. — katgor, Jul 02 '16 at 20:28

fixing invalid XML with Python or Perl: UTF-16 surrogates for emoji encoded in UTF-8

1 Answers1