Is there a comprehensive way to find HTML entities (including foreign language characters) and convert them to hexidecimal encoding or another encoding type that is accepted by ElementTree? Is there a best practice for this?
I'm parsing a large data set of XML, which used HTML entities to encode unicode and special characters. My script passes in an XML file line by line. When I parse the data using python ElementTree, I get the following error.
ParseError: undefined entity: line 296, column 29
I have started by building a dictionary to parse the string and encode into hexidecimal. This has alleviated many of the errors. For example, converting the trademark symbol ™
to ™
. However, there is no end in sight. This is because I have started to find unicode escaped characters such as 'Å' and 'ö' which are for foreign language. I have looked at several options and will describe them below.
xmlcharrefreplace: This did not find foreign language HTML escaped values.
line = line.encode('ascii', 'xmlcharrefreplace')
HTMLParser.enescape(): Did not work, i believe because XML needs some characters escaped such as '<&>'.
h = HTMLParser.HTMLParser()
line = h.unescape(line)
Encoding to UTF-8: Did not work I believe because XML needs some characters escaped.
line = line.encode('utf-8')
BeautifulSoup: This returned a BeautifulSoup object and when converting to a string added an XML version tag to each line and even when replacing that, there was some other type of character additions.
line = BeautifulSoup(line, "xml")
line = str(line).replace('<?xml version="1.0" encoding="utf-8"?>', "").replace("\n", "")
htmlentitydefs: Still manages to miss many characters. For example, still missed '?' and '=', however, this got me further than other options.
from htmlentitydefs import name2codepoint
line = re.sub('&(%s);' % '|'.join(name2codepoint),
lambda m: unichr(name2codepoint[m.group(1)]), line)