Escape HTML entities that aren't XML

Question

I'm parsing an XML file that was produced by an SMS backup app, but some things are escaped with HTML entities. I'm using xml.etree.ElementTree, but it complains with xml.etree.ElementTree.ParseError: reference to invalid character number: line 29, column 308, which lines up with &#55357;&#56841;
 in the XML file. I know that I can use BeautifulSoup. In fact, I already have a working program that uses it, but I'm trying to rewrite it so that I can speed it up. A sample tag is here:

<sms protocol="0" address="1012223434" date="1548857971596" type="1" subject="null" body="... by the time you want a ride. &#55357;&#56841;&#10;" toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" date_sent="0" readable_date="Jan 30, 2019 9:19:31 AM" contact_name="Mom" />

I've used iterparse on the data in the interest of not consuming too much memory, but I've also tried just using parse and clearing every element when I'm done so that I can have better control, but I haven't actually been able to figure out this one part. If I use html.unescape, it unescapes too much, and then I get xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 27, column 175, which is where there was a ' before it was unescaped. If I try to put xml.sax.saxutils.escape on top of the unescaped HTML, then that of course also escapes everything else that's actually supposed to be part of the XML.

How can I unescape the HTML entities without going too far and unescaping all the XML entities?

score 0 · Answer 1 · answered Jan 15 '20 at 21:04

XML Allowed Characters

Per the W3C XML Recommendation

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Notation

&#d; notation means that d is a decimal representation of the character's code point.
&#xd; notation means that d is a hexadecimal representation of the character's code point.

Error Analysis

&#55357; is &#xD83D;, which is not a legal character in XML.
&#56841; is &#xDE09;, which is also not a legal character in XML.

Therefore, your opening statement,

I'm parsing an XML file

is incorrect and you cannot use a conformant XML parser to parse this data. Instead, you're relegated to using techniques at How to parse invalid (bad / not well-formed) XML?

The #1 recommendation there is to fix the problem at the origin. (Hint: In UTF-16, 55,357 56,842 is , so consider encoding issues.) If fixing the origin is not possible, the above link suggests numerous other alternatives for dealing with bad "XML" in many different programming languages, including Python.

from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''<sms protocol="0" address="1012223434" date="1548857971596" type="1" subject="null" body="... by the time you want a ride. &#55357;&#56841;&#10;" toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" date_sent="0" readable_date="Jan 30, 2019 9:19:31 AM" contact_name="Mom" />'''
doc = SimplifiedDoc(html).getElementByTag('sms')
print (doc)

Result:

{'tag': 'sms', 'protocol': '0', 'address': '1012223434', 'date': '1548857971596', 'type': '1', 'subject': 'null', 'body': '... by the time you want a ride. &#55357;&#56841;&#10;', 'toa': 'null', 'sc_toa': 'null', 'service_center': 'null', 'read': '1', 'status': '-1', 'locked': '0', 'date_sent': '0', 'readable_date': 'Jan 30, 2019 9:19:31 AM', 'contact_name': 'Mom'}

You can get the examples of SimplifiedDoc here

Escape HTML entities that aren't XML

2 Answers2

XML Allowed Characters

Error Analysis

See also