I'm parsing an XML file that was produced by an SMS backup app, but some things are escaped with HTML entities. I'm using xml.etree.ElementTree
, but it complains with xml.etree.ElementTree.ParseError: reference to invalid character number: line 29, column 308
, which lines up with ��
in the XML file. I know that I can use BeautifulSoup. In fact, I already have a working program that uses it, but I'm trying to rewrite it so that I can speed it up. A sample tag is here:
<sms protocol="0" address="1012223434" date="1548857971596" type="1" subject="null" body="... by the time you want a ride. �� " toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" date_sent="0" readable_date="Jan 30, 2019 9:19:31 AM" contact_name="Mom" />
I've used iterparse
on the data in the interest of not consuming too much memory, but I've also tried just using parse
and clearing every element when I'm done so that I can have better control, but I haven't actually been able to figure out this one part. If I use html.unescape
, it unescapes too much, and then I get xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 27, column 175
, which is where there was a '
before it was unescaped. If I try to put xml.sax.saxutils.escape
on top of the unescaped HTML, then that of course also escapes everything else that's actually supposed to be part of the XML.
How can I unescape the HTML entities without going too far and unescaping all the XML entities?