1

Is there a comprehensive way to find HTML entities (including foreign language characters) and convert them to hexidecimal encoding or another encoding type that is accepted by ElementTree? Is there a best practice for this?

I'm parsing a large data set of XML, which used HTML entities to encode unicode and special characters. My script passes in an XML file line by line. When I parse the data using python ElementTree, I get the following error.

ParseError: undefined entity: line 296, column 29

I have started by building a dictionary to parse the string and encode into hexidecimal. This has alleviated many of the errors. For example, converting the trademark symbol ™ to ™. However, there is no end in sight. This is because I have started to find unicode escaped characters such as 'Å' and 'ö' which are for foreign language. I have looked at several options and will describe them below.

xmlcharrefreplace: This did not find foreign language HTML escaped values.

line = line.encode('ascii', 'xmlcharrefreplace')

HTMLParser.enescape(): Did not work, i believe because XML needs some characters escaped such as '<&>'.

h = HTMLParser.HTMLParser()
line = h.unescape(line)

Encoding to UTF-8: Did not work I believe because XML needs some characters escaped.

line = line.encode('utf-8')

BeautifulSoup: This returned a BeautifulSoup object and when converting to a string added an XML version tag to each line and even when replacing that, there was some other type of character additions.

line = BeautifulSoup(line, "xml")
line = str(line).replace('<?xml version="1.0" encoding="utf-8"?>', "").replace("\n", "")

htmlentitydefs: Still manages to miss many characters. For example, still missed '&quest;' and '&equals;', however, this got me further than other options.

from htmlentitydefs import name2codepoint

line =  re.sub('&(%s);' % '|'.join(name2codepoint),
            lambda m: unichr(name2codepoint[m.group(1)]), line)
mzjn
  • 48,958
  • 13
  • 128
  • 248
raw-bin hood
  • 5,839
  • 6
  • 31
  • 45
  • similar issues https://github.com/jbmorley/evernote-bookmarks/issues/3 and https://stackoverflow.com/questions/15209965/undefined-entity-error-while-using-elementtree and https://stackoverflow.com/questions/7693515/why-is-elementtree-raising-a-parseerror – Joe Feb 21 '18 at 07:09
  • https://chat.stackoverflow.com/rooms/24253/discussion-between-m-brindley-and-theta – Joe Feb 21 '18 at 07:12
  • Can we see a representative sample of your not-quite-XML dataset? – mzjn Feb 21 '18 at 12:18
  • It's XML. And it's well formed. The problems does not lie there and this question can be pondered without an example of the XML. There is nothing you will glean from that. I'm parsing millions of records and most are parsing just fine. It's the ones with extremely old html-entities, and how to get rid of them. There are entities in there.. such as '&lE;' which cannot be found by searching Google. – raw-bin hood Feb 21 '18 at 14:20
  • If you have references to undefined entities then your dataset is not well-formed, which means that it's not XML. – mzjn Feb 21 '18 at 16:47
  • Here is a link to all the XML (https://bulkdata.uspto.gov/). I'm parsing the front-page grants and applications first. It's XML, but it has been confounded with old html entities in the older files (2004 and older). Please feel free to communicate your opinions with the USPTO. – raw-bin hood Feb 22 '18 at 07:12

1 Answers1

0

Here is what I have done to solve this problem. I have used a multi-pronged approach in lieu of having one module or solution. I wrote a scraper and used it to build a large dictionary (replacement_dict) which is larger then the sample dict I posted here. Scrape a site like this (https://www.freeformatter.com/html-entities.html#iso88591-characters). Then I did a replace of all entities in that dict in the line I was sending to the sanitize function. From there I used two packages to scrape remaining html entities which are included in the package, and finally used a basic regex replacement to get the html entities which I could not seem to find either online in a "comprehensive list" or with the other packages. That was the problem, was that there were erroneous entities, and entities that even a Google search could not come up with nor sites like this: http://www.graphemica.com. Anyway, problem solved. All the html entities, even the erroneous ones are dealt with. The code is posted below. Maybe overkill, but it got every last one of them!

    replacement_dict = {
        '&sect;' : '&#x00A7;',
        '&otilde;' : '&#x00F5;',
        '&iacute;' : '&#x00ED;',
        '&cent;' : '&#x00A2;',
        '&Ocirc;' : '&#x00D4;',
        '&mdash;' : '&#x2014;',
        '&aring;' : '&#x00E5;',
        '&frac12;' : '&#x00BD;',
        '&Ograve;' : '&#x00D2;',
        '&szlig;' : '&#x00DF;',
        '&ccedil;' : '&#x00E7;',
        '&Uuml;' : '&#x00DC;',
        '&Acirc;' : '&#x00C2;',
        '&brvbar;' : '&#x00A6;',
        '&commat;' : "",
        '&lE;' : "",
        '&mgr;' : "",
        '&angst;' : "A",
        '&ohgr;' : "",
        '&Dgr;' : ""

    }


    # Replace rare html entities not handled by other packages
    for key, value in replacement_dict.items():
        line = line.replace(key, value)

    # replace further known entities using library
    line =  re.sub('&(%s);' % '|'.join(name2codepoint),
            lambda m: unichr(name2codepoint[m.group(1)]), line)

    # further replace known xml char replace to ascii
    line = line.encode('ascii', 'xmlcharrefreplace')

    pattern = re.compile(r"\&[A-Za-z0-9]{1,}\;")
    # finally use regex to replace anyhting that looks like an html entity
    line = re.sub(pattern, "", line)
raw-bin hood
  • 5,839
  • 6
  • 31
  • 45