How to parse malformed HTML in python, using standard libraries

Question

There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing.

I've found plenty of great third-party libraries for this task, but this question is about the python standard library.

Requirements:

Use only Python standard library components (any 2.x version)
DOM support
Handle HTML entities ( )
Handle partial documents (like: Hello, <i>World</i>!)

Bonus points:

XPATH support
Handle unclosed/malformed tags. (<big>does anyone here know <html ???

Here's my 90% solution, as requested. This works for the limited set of HTML I've tried, but as everyone can plainly see, this isn't exactly robust. Since I did this by staring at the docs for 15 minutes and one line of code, I thought I would be able to consult the stackoverflow community for a similar but better solution...

from xml.etree.ElementTree import fromstring
DOM = fromstring("<html>%s</html>" % html.replace('&nbsp;', '&#160;'))

I'm not sure, but I would think http://docs.python.org/library/markup.html is an exhaustive list of all the standard library *ML functionality. — Nick T, Apr 20 '10 at 17:03
I don't get it. Are you expecting us to do what? You know that there is no such module in stdlib. What is your question? — SilentGhost, Apr 20 '10 at 17:04
Could someone explain why I'm getting downrated? This is a legitimate question that isn't currently addressed. We could all benefit from being able to do this task without requiring a third-party library. — bukzor, Apr 20 '10 at 17:04
@bukzor: I think you're misunderstanding the idea behind the stdlib. — SilentGhost, Apr 20 '10 at 17:06
@bukzor, Your claim "We could all benefit from being able to do this task without requiring a third-party library." does not strike me as true. — Mike Graham, Apr 20 '10 at 17:06
@SilentGhost there are about 10 libraries that take me 90% of the way there. I was hoping someone here had already dealt with the last bit. For example, ElementTree, which is in the standard library, has a TidyHTMLTreeBuilder for parsing arbitrary HTML, but this wasn't included in the standard library. Was this because the same functionality is elsewhere in the stdlib? How can I know without asking. — bukzor, Apr 20 '10 at 17:16
@Mr. Graham: At least in my burrough of the scriptsphere, it's extremely useful to be able to email a script that will 'just work' without external dependencies. — bukzor, Apr 20 '10 at 17:20
@Nick T: Thanks for that. The large number of libraries is part of the problem. I don't know which one might be able to do what I need. I've added that link to the question. — bukzor, Apr 20 '10 at 17:21
@bukzor: If you can get 90% of the way there with std. libs, point out some explicit examples of what you are unable to do. If you work somewhere where you can easily pass along Python scripts, your audience shouldn't fret too much at the 15 seconds it takes to install a nice packaged library, especially if you have it downloaded to your intranet and provide a handy-dandy link in the email. If you're being a sysadmin, maybe repackage a bunch of useful ones and push them out? — Nick T, Apr 20 '10 at 17:28
@SilentGhost: A common python motto is 'batteries included', meaning that you should be able to do most tasks using the stdlib. Maybe HTML DOM is not one of those things. That's what this question is trying to clarify. — bukzor, Apr 20 '10 at 17:34
@buzkor: As mikerobi pointed out, the BeautifulSoup source is really small, so if you really want a single-file script with no 3P dependencies, copy-paste sounds like your best bet, and just skip trying to stitch together some stdlibs. — Nick T, Apr 20 '10 at 17:35
An old question, but it's gotta be said: that last 10% is still 90% (or more) of the work. — mc0e, Jul 01 '14 at 14:28

score 47 · Accepted Answer · answered Apr 21 '10 at 06:18

47

Parsing HTML reliably is a relatively modern development (weird though that may seem). As a result there is definitely nothing in the standard library. HTMLParser may appear to be a way to handle HTML, but it's not -- it fails on lots of very common HTML, and though you can work around those failures there will always be another case you haven't thought of (if you actually succeed at handling every failure you'll have basically recreated BeautifulSoup).

There are really only 3 reasonable ways to parse HTML (as it is found on the web): lxml.html, BeautifulSoup, and html5lib. lxml is the fastest by far, but can be a bit tricky to install (and impossible in an environment like App Engine). html5lib is based on how HTML 5 specifies parsing; though similar in practice to the other two, it is perhaps more "correct" in how it parses broken HTML (they all parse pretty-good HTML the same). They all do a respectable job at parsing broken HTML. BeautifulSoup can be convenient though I find its API unnecessarily quirky.

answered Apr 21 '10 at 06:18

Ian Bicking

9,762
6
33
32

Great answer. Thanks! I don't have enough rep to uprate you. QQ I wish people weren't so touchy about hard questions. The good scientist seeks negative experiments as well.. – bukzor Apr 21 '10 at 06:24
@Ian Bicking: finally got enough rep to bump you. Just to confirm, there's no known way to get ElementTree (as it exists in the stdlib) to parse real-world HTML? – bukzor Apr 21 '10 at 17:04
You can have BeautifulSoup (with ElementSoup) or html5lib parse the HTML and generate an ElementTree structure, but ElementTree itself definitely cannot parse HTML. – Ian Bicking Apr 22 '10 at 19:31
1

With some finagling and a little bit of HTML-correction, I've gotten ElementTree to parse all of RosettaCode.org. The most annoying part is adding all the html entities to the parser by hand. There's even an option for this in the etree docs, but it's unimplemented for undocumented reasons. You can see the code here: http://bukzor.hopto.org/svn/software/python/rosetta_pylint.py – bukzor Apr 25 '10 at 20:48

score 5 · Answer 2 · answered Apr 20 '10 at 16:42

5

Take the source code of BeautifulSoup and copy it into your script ;-) I'm only sort of kidding... anything you could write that would do the job would more or less be duplicating the functionality that already exists in libraries like that.

If that's really not going to work, I have to ask, why is it so important that you only use standard library components?

answered Apr 20 '10 at 16:42

David Z

128,184
27
255
279

It's not so important. It's simply my question. As I said, there are tons of html and xml support in the python library. It seems like something there should support this. If not, that's an answer too, but I'm not convinced yet. – bukzor Apr 20 '10 at 16:58
Note that BeautifulSoup is no longer being maintained. I prefer lxml.html myself. Overall, this is a great answer. – Mike Graham Apr 20 '10 at 17:09
Where did you hear that? The BeautifulSoup website shows no evidence that it is no longer being maintained. In fact the most recent release was 11 days ago. (Of course, any other third-party HTML parser works just as well for the argument I was making in the answer) – David Z Apr 20 '10 at 17:25
Maybe he was thinking BS 3.0 was only for Python 3.x? Their site indicates BS 3.0 is for Py 2.3-2.6, and BS 3.1 is for Py 3.x (though ironically the last BS 3.1 release is about a year old, versus a couple weeks for BS 3.0) – Nick T Apr 20 '10 at 17:42
@David, Richardson has said multiple times that he is trying his best to quit BS development, though it seems he does still do a little. See e.g. http://www.crummy.com/software/BeautifulSoup/3.1-problems.html – Mike Graham Apr 20 '10 at 17:47
@Mike Graham: Under that link I see this: "... you can use Element Soup to feed the HTML into Beautiful Soup once ElementTree has cleaned it up." Can anyone expand what he means by that? How do you clean up HTML with ElementTree? – bukzor Apr 21 '10 at 04:45
@bukzor, (It seems a bit odd to ask me about stuff found on a page I presented about why not to use a piece of software.) In any event, as I understand the element tree API, you would call `ElementSoup.parse(some_file).write(some_new_place)` to parse an HTML file then write the tree you got after reconciling everything less than kosher about it. http://effbot.org/zone/element-index.htm#documentation provides some information about ElementTree in its various incarnations (which include this and other HTML parsers). Feel free to open a question for a more complete answer. – Mike Graham Apr 21 '10 at 05:08
@Mike Graham: I just noticed that the quote said said ElementSoup, not ElementTree. I was asking about it because it seemed to imply that I could use ElementTree independent of BeautifulSoup for HTML "cleaning". – bukzor Apr 21 '10 at 05:36
@bukzor, Cleaning HTML is the topic of another question. The snippet I provide should be the essence of doing it with an ElementTree HTML parser. I don't understand to what you're referring to about "the only reference to html seems to be a side project that is unmaintained since 2007". If you're talking about the ElementTree docs I linked to, stuff not applying to HTML directly is relevant if you're interested in an ElementTree-based HTML parser since the API is independent of the exact format being parsed/generated using ElementTree. – Mike Graham Apr 21 '10 at 05:39
1

@bukzor, ElementSoup is an implementation of ElementTree using BeautifulSoup for parsing. ElementTree is an API with many implementations for parsing XML and HTML. – Mike Graham Apr 21 '10 at 06:19
@Mike Graham: Thanks. I'm inferring that any HTML parsers implemented with ElementTree are not included in the stdlib. Do you know of a better-maintained etree-html parser than esoup? – bukzor Apr 21 '10 at 17:08
@bukzor, There are no general-purpose, robust HTML parsers of any kind in the stdlib. `lxml.html`, which I have mentioned several places, provides an extended ElementTree API. `html5lib`, which others have mentioned, is compatible with a number of APIs including multiple ElementTree implementations as I best understsand it. – Mike Graham Apr 21 '10 at 18:13

score 4 · Answer 3 · answered Apr 20 '10 at 17:17

4

Your choices are to change your requirements or to duplicate all of the work done by the developers of third party modules.

Beautiful soup consists of a single python file with about 2000 lines of code, if that is too big of a dependency, then go ahead and write your own, it won't work as well and probably won't be a whole lot smaller.

answered Apr 20 '10 at 17:17

mikerobi

20,527
5
46
42

1

If it's really that compact (never really bothered to look :P ) and he's hell-bent on having a script work without any other dependencies, copy-paste sounds a great plan. – Nick T Apr 20 '10 at 17:32
5

Literal copy-and-paste is a ridiculous way to add a dependency. – Mike Graham Apr 20 '10 at 17:38

score 1 · Answer 4 · answered Apr 20 '10 at 16:36

1

doesn't fit your requirement of the std only, but beautifulsoup is nice

answered Apr 20 '10 at 16:36

PW.

3,727
32
35

That's one of the libraries that I referenced with this: "I've found plenty of great third-party libraries for this task, but this question is about the python standard library." – bukzor Apr 20 '10 at 16:41

score 1 · Answer 5 · answered Apr 20 '10 at 17:06

1

I cannot think of any popular languages with a good, robust, heuristic HTML parsing library in its stdlib. Python certainly does not have one, which is something I think you know.

Why the requirement of a stdlib module? Most of the time when I hear people make that requirement, they are being silly. For most major tasks, you will need a third party module or to spend a whole lot of work re-implementing one. Introducing a dependency is a good thing, since that's work you didn't have to do.

So what you want is lxml.html. Ship lxml with your code if that's an issue, at which point it becomes functionally equivalent to writing it yourself except in difficulty, bugginess, and maintainability.

answered Apr 20 '10 at 17:06

Mike Graham

73,987
14
101
130

1

From my research, I was seeing that as the most common answer, but I don't know, and I'm still not convinced that there's no such capability in the stdlib. You'll have to admit that a script that uses no external library is much more likely to work correctly for novice users. – bukzor Apr 20 '10 at 17:30
@bukzor, Well get convinced, since it's the case. =p And I do not have to admit that at all. ;) – Mike Graham Apr 20 '10 at 17:48
5

Parsing HTML is something people have only actually understood widely for a few years now; it's taken shockingly long. So it can be said quite definitively that there is nothing in the standard library: BeautifulSoup, html5lib, and lxml.html makes a complete list. – Ian Bicking Apr 20 '10 at 20:10
1

@Ian Bicking: If you'd make that an answer, I'd check it. Am I getting downrated simply because my answer is no? – bukzor Apr 21 '10 at 04:18

score 0 · Answer 6 · answered Feb 16 '13 at 11:22

As already stated, there is currently no satisfying solution only with standardlib. I had faced the same problem as you, when I tried to run one of my programs on an outdated hosting environment without the possibility to install own extensions and only python2.6. Solution:

Grab this file and the latest stable BeautifulSoup version of the 3er series (3.2.1 as of now). From the tar-file there, only pick BeautifulSoup.py, it's the only one that you really need to ship with your code. So you have these two files in your path, all you need to do then, to get a casual etree object from some HTML string, like you would get it from lxml, is this:

from StringIO import StringIO
import ElementSoup

tree = ElementSoup.parse(StringIO(input_str))

lxml itself and html5lib both require you, to compile some C-code in order to make it run. It is considerably more effort to get them working, and if your environment is restricted, or your intended audience not willing to do that, avoid them.

html5lib has no extensions (e.g., C code) that it depends upon. It can *optionally* use several (such as `datrie`) to improve performance, but it will work fine without. — gsnedders, Aug 04 '13 at 16:37

How to parse malformed HTML in python, using standard libraries

6 Answers6

Linked