I would like to be able to efficiently parse large HTML documents in Python. I am aware of Liza Daly's fastiter and the similar concept in the Python's own cElementTree. However, neither of these handle broken XML, which HTML reads as, well. In addition, the document may contain other broken XML.
Similarly, I'm aware of answers like this, which suggest not using any form of iterparse at all, and this is, in fact, what I'm using. However, I am trying to optimize past the biggest bottleneck in my program, which is the parsing of the documents.
Furthermore, I've done a little bit of experimentation using the SAX-style target handler for lxml parsers- I'm not sure what's going on, but it outright causes Python to stop working! Not merely throwing an exception, but a "python.exe has stopped working" message popup. I've no idea what's happening here, but I'm not even sure if this method is actually better than the standard parser, because I've seen very little about it on the Internet.
As such, my question is: Is there anything similar to iterparse, allowing me to quickly and efficiently parse through a document, that doesn't throw a snit fit when the document isn't well formed XML (IE. has recovery from poorly formed XML)?