Parsing XHTML including standard entities using ElementTree

Question

Consider the following snippet:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head><title>&copy;</title></head>
  <body></body>
</html>

It is deemed valid XHTML 1.0 Transitional per W3C's validator (https://validator.w3.org/). However, Python (3.7)'s ElementTree chokes on it with

$ python -c 'from xml.etree import ElementTree as ET; ET.parse("foo.html")'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 1197, in parse
    tree.parse(source, parser)
File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 598, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: undefined entity &copy;: line 4, column 15

Note that © is indeed an entity defined (ultimately) in xhtml-lat1.ent.

Is there a way to parse such documents using ElementTree? An answer to a similar question suggested manually prepending the appropritate XML definitions to the HTML content (e.g. <!ENTITY nbsp ' '>) but that's not really a general solution (unless one prepends a header with all definitions to any document, but it seems like there should be something simpler?).

Thanks in advance.

Stupid question probably, but is "foo.html" the name of the file you're parsing? That would be HTML, not XHTML; doesn't that throw the parser off? — Mr Lister, Aug 20 '18 at 19:48
That doesn't matter (rename it as xhtml if you want, error stays). — antony, Aug 20 '18 at 21:36

score 0 · Accepted Answer · answered Aug 20 '18 at 14:12

Consider about lxml?

from lxml import html


root = html.fromstring("""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head><title>&copy;</title></head>
  <body></body>
</html>
""".strip())
print(root.head.getchildren()[0].text)
# '©'

© is not valid in xml. xml package really parse xml but not html. Actually built-in html parser do can parse this content:

from html.parser import HTMLParser


parser = HTMLParser()
parser.feed("""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head><title>&copy;</title></head>
  <body></body>
</html>
""".strip())
# no error

But its api is really difficult to use lol. lxml provides an equivalent api.

It would seem like there should be a builtin solution, but sure, lxml is good enough. — antony, Aug 21 '18 at 07:56
@antony Well, if you can accept using `html.parser.HTMLParser`, you do can use a builtin solution. — Sraw, Aug 21 '18 at 07:59

score 0 · Answer 2 · answered May 15 '23 at 21:42

In python3.4+ you can use html.unescape to convert html5 entity references into the respective unicode characters. After that, any xml parser works.

from html import escape, unescape
textXML = re.sub("\\&\\w+\\;", lambda x: escape(unescape(x.group(0))), text)

Parsing XHTML including standard entities using ElementTree

2 Answers2