I'm parsing a large XML file using lxml in Python 3 that has HTML character codes (e.g. [
and ]
).
Here's an example of the problem and an example of my attempt to use html.unescape()
as suggested when this was marked duplicate. I'm still struggling to make this work. The following works, but is pretty slow and seems really hacky:
from io import StringIO, BytesIO
from lxml import etree
import html
import re
s = b"""<?xml version="1.0" encoding="UTF-8"?><tag>[0001]</tag>"""
def unescape(s):
# According to this: http://xml.silmaril.ie/specials.html
# There are only 4 special characters for XML. Handle them separately.
#
# This site shows this other codes.
# https://www.dvteclipse.com/documentation/svlinter/How_to_use_special_characters_in_XML.3F.html
#
# Use temporary text that isn't likely to be in data.
tmptxt = {b'&': ((b'&', b'&', b'&', ), b'zZh7001HdahHq'),
b'<': ((b'<', b'<', b'<', ), b'zZh7002HdahHq'),
b'>': ((b'>', b'>', b'>',), b'zZh7002HdahHq'),
b''': ((b''', b''', b'&apos',), b'zZh7003HdahHq')}
# Replace XML special chars with tmptxt
for k, v in tmptxt.items():
for search in v[0]:
s = s.replace(search, v[1])
# Use html.unescape
s = html.unescape(s.decode()).encode()
# replace tmptxt with the allowed XML special chars.
for k, v in tmptxt.items():
s = s.replace(v[1], k)
# Get rid of any other codes and hope for the best
regex = re.compile(rb'&[^\t\n\f <&#;]{1,32};')
s = regex.sub(b'', s)
return s
tree = etree.fromstring(unescape(s))
print(etree.tostring(tree))
The second approach that seems to work is tree = etree.fromstring(s, parser=etree.XMLParser(recover=True))
. This also seems pretty slow but is obviously much cleaner.