How to deal with malformed XML with HTML character codes in lxml

Question

I'm parsing a large XML file using lxml in Python 3 that has HTML character codes (e.g. [ and ]).

Here's an example of the problem and an example of my attempt to use html.unescape() as suggested when this was marked duplicate. I'm still struggling to make this work. The following works, but is pretty slow and seems really hacky:

from io import StringIO, BytesIO
from lxml import etree
import html
import re

s = b"""<?xml version="1.0" encoding="UTF-8"?><tag>&lsqb;0001&rsqb;</tag>"""

def unescape(s):
    # According to this: http://xml.silmaril.ie/specials.html
    # There are only 4 special characters for XML.  Handle them separately.
    #
    # This site shows this other codes.
    # https://www.dvteclipse.com/documentation/svlinter/How_to_use_special_characters_in_XML.3F.html
    #
    # Use temporary text that isn't likely to be in data.
    tmptxt = {b'&amp;':  ((b'&#x26;', b'&amp;', b'&#38', ), b'zZh7001HdahHq'),
              b'&lt;':   ((b'&#x3c;', b'&#60;', b'&lt', ),  b'zZh7002HdahHq'),
              b'&gt;':   ((b'&#x3e;', b'&#62;', b'&gt',),   b'zZh7002HdahHq'),
              b'&apos;': ((b'&#x27;', b'&#39;', b'&apos',), b'zZh7003HdahHq')}

    # Replace XML special chars with tmptxt
    for k, v in tmptxt.items():
        for search in v[0]:
            s = s.replace(search, v[1])

    # Use html.unescape
    s = html.unescape(s.decode()).encode()

    # replace tmptxt with the allowed XML special chars.
    for k, v in tmptxt.items():
        s = s.replace(v[1], k)

    # Get rid of any other codes and hope for the best
    regex = re.compile(rb'&[^\t\n\f <&#;]{1,32};')
    s = regex.sub(b'', s)

    return s


tree = etree.fromstring(unescape(s))

print(etree.tostring(tree))

The second approach that seems to work is tree = etree.fromstring(s, parser=etree.XMLParser(recover=True)). This also seems pretty slow but is obviously much cleaner.

Just in case the duplicate is not clear enough, what you need is `s = html.unescape(s) ... etree.parse(sfid)` — DeepSpace, Oct 24 '18 at 14:36
In other words, I have to pre-process the file first? No way around that? Any way to html.unescape within the `lxml` parser? — EpicAdv, Oct 24 '18 at 14:54
I don't know. You can check by reading `lxml` documentation, which is what I would do if I were to attempt to answer that question. — DeepSpace, Oct 24 '18 at 15:16

How to deal with malformed XML with HTML character codes in lxml

0 Answers0