2

Consider the following snippet:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head><title>&copy;</title></head>
  <body></body>
</html>

It is deemed valid XHTML 1.0 Transitional per W3C's validator (https://validator.w3.org/). However, Python (3.7)'s ElementTree chokes on it with

$ python -c 'from xml.etree import ElementTree as ET; ET.parse("foo.html")'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 1197, in parse
    tree.parse(source, parser)
File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 598, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: undefined entity &copy;: line 4, column 15

Note that &copy; is indeed an entity defined (ultimately) in xhtml-lat1.ent.

Is there a way to parse such documents using ElementTree? An answer to a similar question suggested manually prepending the appropritate XML definitions to the HTML content (e.g. <!ENTITY nbsp ' '>) but that's not really a general solution (unless one prepends a header with all definitions to any document, but it seems like there should be something simpler?).

Thanks in advance.

antony
  • 2,877
  • 4
  • 31
  • 43
  • Stupid question probably, but is "foo.html" the name of the file you're parsing? That would be HTML, not XHTML; doesn't that throw the parser off? – Mr Lister Aug 20 '18 at 19:48
  • That doesn't matter (rename it as xhtml if you want, error stays). – antony Aug 20 '18 at 21:36

2 Answers2

0

Consider about lxml?

from lxml import html


root = html.fromstring("""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head><title>&copy;</title></head>
  <body></body>
</html>
""".strip())
print(root.head.getchildren()[0].text)
# '©'

&copy; is not valid in xml. xml package really parse xml but not html. Actually built-in html parser do can parse this content:

from html.parser import HTMLParser


parser = HTMLParser()
parser.feed("""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head><title>&copy;</title></head>
  <body></body>
</html>
""".strip())
# no error

But its api is really difficult to use lol. lxml provides an equivalent api.

Sraw
  • 18,892
  • 11
  • 54
  • 87
  • It would seem like there should be a builtin solution, but sure, lxml is good enough. – antony Aug 21 '18 at 07:56
  • @antony Well, if you can accept using `html.parser.HTMLParser`, you do can use a builtin solution. – Sraw Aug 21 '18 at 07:59
0

In python3.4+ you can use html.unescape to convert html5 entity references into the respective unicode characters. After that, any xml parser works.

from html import escape, unescape
textXML = re.sub("\\&\\w+\\;", lambda x: escape(unescape(x.group(0))), text)
Guido U. Draheim
  • 3,038
  • 1
  • 20
  • 19