I need is a way to use the html5lib parser to generate a real xml.etree.ElementTree. (lxml is not an option for portability reasons.)
ELementTree.parse
can take a parser as an optional parameter
xml.etree.ElementTree.parse(source, parser=None)
but it's not clear what such a parser would look like. Is there a class or object within HTML5 I could use for the parser
argument? Documentation for both libraries on this issue is thin.
Context:
I have a malformed XHTML file that can't be parsed with ElementTree.parse
:
<?xml version="1.0" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Title</title></head>
<body><div class="cls">Note that this br<br>is missing a closing slash</div></body>
</html>
So I used html5lib.parse
instead with the default treebuilder="etree"
parameter, which worked fine.
But html5lib apparently does not output an xml.etree.ElementTree
object, just one with a near-identical API. There are two problems with this:
- html5lib's
find
does not support thenamespaces
parameter, making XPath excessively verbose without a clumsy wrapper function. - The Eclipse debugger does not support drill-through of html5lib etrees.
So I cannot use either ElementTree or html5lib alone.