Efficiently parsing broken XML/HTML in python

Question

I would like to be able to efficiently parse large HTML documents in Python. I am aware of Liza Daly's fastiter and the similar concept in the Python's own cElementTree. However, neither of these handle broken XML, which HTML reads as, well. In addition, the document may contain other broken XML.

Similarly, I'm aware of answers like this, which suggest not using any form of iterparse at all, and this is, in fact, what I'm using. However, I am trying to optimize past the biggest bottleneck in my program, which is the parsing of the documents.

Furthermore, I've done a little bit of experimentation using the SAX-style target handler for lxml parsers- I'm not sure what's going on, but it outright causes Python to stop working! Not merely throwing an exception, but a "python.exe has stopped working" message popup. I've no idea what's happening here, but I'm not even sure if this method is actually better than the standard parser, because I've seen very little about it on the Internet.

As such, my question is: Is there anything similar to iterparse, allowing me to quickly and efficiently parse through a document, that doesn't throw a snit fit when the document isn't well formed XML (IE. has recovery from poorly formed XML)?

possible duplicate http://stackoverflow.com/questions/3577652/how-to-parse-broken-xml-in-python — santosh-patil, Jan 26 '14 at 15:13
Beautifulsoup is pathetically slow, from what I remember! Looking at that duplicate: No, not really. It's not just HTML entities; it's also things like `&&` scattered about that make the XML parser throw a fit. — Firnagzen, Jan 26 '14 at 15:18

score 0 · Answer 1 · answered Mar 20 '16 at 09:55

I would use this one.

https://github.com/iogf/ehp

It is faster than lxml and it handles broken html like.

from ehp import *

doc = '''<html>
<body>
<p> cool </html></body>'''

html = Html()
dom = html.feed(doc)
print dom

It builds an AST according to the most possible HTML structure possible. Then you can work on the AST.

Efficiently parsing broken XML/HTML in python

1 Answers1