40

I am working on a project that will involve parsing HTML.

After searching around, I found two probable options: BeautifulSoup and lxml.html

Is there any reason to prefer one over the other? I have used lxml for XML some time back and I feel I will be more comfortable with it, however BeautifulSoup seems to be much common.

I know I should use the one that works for me, but I was looking for personal experiences with both.

user225312
  • 126,773
  • 69
  • 172
  • 181

4 Answers4

47

The simple answer, imo, is that if you trust your source to be well-formed, go with the lxml solution. Otherwise, BeautifulSoup all the way.

Edit:

This answer is three years old now; it's worth noting, as Jonathan Vanasco does in the comments, that BeautifulSoup4 now supports using lxml as the internal parser, so you can use the advanced features and interface of BeautifulSoup without most of the performance hit, if you wish (although I still reach straight for lxml myself -- perhaps it's just force of habit :)).

simon
  • 15,344
  • 5
  • 45
  • 67
  • I see. I will go with lxml only, my HTML comes from a robust website so I can (hopefully) depend on it to be well formed. – user225312 Feb 11 '11 at 09:28
  • 6
    In my experience, lxml.html handles ill-formed html just fine. – Steven Feb 11 '11 at 09:52
  • @Steven: So you also recommend `lxml.html` over `BeautifulSoup`? – user225312 Feb 11 '11 at 09:56
  • 3
    Yes, I would, certainly if you are already familiar with lxml, and you have no "pure python" requirements (as on Google Appengine). Personally, I haven't had any problems with processing pages with lxml.html (on the contrary, I have been able to process pages that gave problems with Beautifulsoup), except once where I had to explicitly provide the correct character encoding (because lxml "trusted" the incorrect http headers/html meta tags). Also note that the [ElementSoup](http://codespeak.net/lxml/elementsoup.html) enables lxml.html to use the BeautifulSoup parser should it be necessary) – Steven Feb 11 '11 at 10:35
  • @Steven: my own experience was not so good, but I'll credit yours next time I'm faced with a choice. +1 for mentioning ElementSoup, too. Finally, @Patrick: another argument in favour of lxml is better speed in almost every case. – simon Feb 11 '11 at 12:07
  • 7
    This question popped up because of a recent edit. I just wanted to not that `BeautifulSoup4` supports using `lxml` as the underlying parser -- so now you can basically get *almost* the speed of lxml (just a minor hit) with all the bonuses of BeautifulSoup. – Jonathan Vanasco Oct 23 '13 at 18:19
22

In summary, lxml is positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes a soupparser module to fall back on BeautifulSoup's functionality. BeautifulSoup is a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.

lxml documentation says that both parsers have advantages and disadvantages. For this reason, lxml provides a soupparser so you can switch back and forth. Quoting,

BeautifulSoup uses a different parsing approach. It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better.

In the end they are saying,

The downside of using this parser is that it is much slower than the HTML parser of lxml. So if performance matters, you might want to consider using soupparser only as a fallback for certain cases.

If I understand them correctly, it means that the soup parser is more robust --- it can deal with a "soup" of malformed tags by using regular expressions --- whereas lxml is more straightforward and just parses things and builds a tree as you would expect. I assume it also applies to BeautifulSoup itself, not just to the soupparser for lxml.

They also show how to benefit from BeautifulSoup's encoding detection, while still parsing quickly with lxml:

>>> from BeautifulSoup import UnicodeDammit

>>> def decode_html(html_string):
...     converted = UnicodeDammit(html_string, isHTML=True)
...     if not converted.unicode:
...         raise UnicodeDecodeError(
...             "Failed to detect encoding, tried [%s]",
...             ', '.join(converted.triedEncodings))
...     # print converted.originalEncoding
...     return converted.unicode

>>> root = lxml.html.fromstring(decode_html(tag_soup))

(Same source: http://lxml.de/elementsoup.html).

In words of BeautifulSoup's creator,

That's it! Have fun! I wrote Beautiful Soup to save everybody time. Once you get used to it, you should be able to wrangle data out of poorly-designed websites in just a few minutes. Send me email if you have any comments, run into problems, or want me to know about your project that uses Beautiful Soup.

 --Leonard

Quoted from the Beautiful Soup documentation.

I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.

Also, from the lxml website,

lxml has been downloaded from the Python Package Index more than two million times and is also available directly in many package distributions, e.g. for Linux or MacOS-X.

And, from Why lxml?,

The C libraries libxml2 and libxslt have huge benefits:... Standards-compliant... Full-featured... fast. fast! FAST! ... lxml is a new Python binding for libxml2 and libxslt...

Sergey Orshanskiy
  • 6,794
  • 1
  • 46
  • 50
2

Use both? lxml for DOM manipulation, BeautifulSoup for parsing:

http://lxml.de/elementsoup.html

twasbrillig
  • 17,084
  • 9
  • 43
  • 67
ymv
  • 2,123
  • 13
  • 21
  • 2
    What do you mean by "parsing"? I ask this because, IMHO, parsing is just the same as performing operations on the DOM. – nn0p Aug 24 '15 at 05:05
1

lxml's great. But parsing your input as html is useful only if the dom structure actually helps you find what you're looking for.

Can you use ordinary string functions or regexes? For a lot of html parsing tasks, treating your input as a string rather than an html document is, counterintuitively, way easier.

dfichter
  • 1,078
  • 8
  • 9
  • 4
    "easier", perhaps -- but not robust by any means. It's extremely easy for a formatting change in the input HTML (line wrapping, whitespace, element encoding, etc) to break a manually-developed "parser". If you want to build something to parse input you don't control, or which otherwise might change in the future, using a real HTML parser is the Right Thing. – Charles Duffy Aug 30 '11 at 17:22
  • 11
    @dfichter You've done it again. You spake the unspeakable; you've uttered the unholy incantation by crossing html and regexes in the same breath. You've surely [wandered into the mouth of madness as so many poor souls before you](http://stackoverflow.com/a/1732454/462302). – aculich Jan 16 '12 at 21:26