254

I'm looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects.

If I have a document of the form:

<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>

then it should give me a way to access the nested tags via the name or id of the HTML tag so that I can basically ask it to get me the content/text in the div tag with class='container' contained within the body tag, or something similar.

If you've used Firefox's "Inspect element" feature (view HTML) you would know that it gives you all the tags in a nice nested manner like a tree.

I'd prefer a built-in module but that might be asking a little too much.


I went through a lot of questions on Stack Overflow and a few blogs on the internet and most of them suggest BeautifulSoup or lxml or HTMLParser but few of these detail the functionality and simply end as a debate over which one is faster/more efficent.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
ffledgling
  • 11,502
  • 8
  • 47
  • 69
  • 3
    like all the other answerers, I would recommend BeautifulSoup because it is really good in handling broken HTML files. – Pascal Rosin Jul 29 '12 at 12:24

7 Answers7

274

So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.

Aadaam
  • 3,579
  • 1
  • 14
  • 9
  • 4
    What exactly is the parsed_html object? – ffledgling Jul 29 '12 at 12:21
  • 2
    parsed_html is a BeautifulSoup object, think of it like a DOMElement or DOMDocument, except it has "tricky" properties, like "body" will refer to the BeautifulSoup object (remember, it's a tree node basically) of the first (and in this case, only) body element of the root element (in our case, html) – Aadaam Jul 29 '12 at 12:38
  • 5
    General info: If performance is critical, better use the `lxml` library instead (see answer below). With `cssselect` it’s quite useful aswell and performance is often 10- to 100-fold better than the other libraries available. – Lenar Hoyt Nov 08 '14 at 01:04
  • note: `class` attribute is special: `BeautifulSoup(html).find('div', 'container').text` – jfs Mar 10 '16 at 17:17
  • @mcb: if `lxml` is installed; `BeautifulSoup` can use it to parse html. – jfs Mar 10 '16 at 17:18
  • @J.F.Sebastian I know, as far as I remember the BeatifulSoup front end makes it slow, but I might be wrong. Did you compare it? – Lenar Hoyt Mar 10 '16 at 17:22
  • @mcb I don't remember having performance issues with it. YMMV. You could try to pass `parse_only=SoupStrainer(*interesting_parts)` and see if it helps. – jfs Mar 10 '16 at 17:30
  • 2
    `parsed_html = BeautifulSoup(html)` doesn't work for me, `parsed_html = BeautifulSoup(html, 'html.parser')` does – Pavel Mar 14 '17 at 12:11
  • 2
    @BaileyParker you'd think in a sea of people constantly dealing with python2, python3, c++11, c++17 Opencv 4.3, and Java 2021, someone would have ****ing thought of naming it `import bs` so when they change their minds with bs5 they don't break everyone's code yet again – Nathan majicvr.com Jun 10 '20 at 10:41
  • Remark: beautifulsoup supports css selector too, see https://stackoverflow.com/questions/24801548/how-to-use-css-selectors-to-retrieve-specific-links-lying-in-some-class-using-be – so it's not "less user friendly" ⟺ also "jquery-like". – user202729 Jan 24 '22 at 07:28
  • 3
    @Nathan To be fair, major version update means major incompatible change, so it's likely that the code would break in one way or the other anyway. Better to break early than late. – user202729 Jan 24 '22 at 07:32
106

I guess what you're looking for is pyquery:

pyquery: a jquery-like library for python.

An example of what you want may be like:

from pyquery import PyQuery    
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or     tag = pq('div.class')
print tag.text()

And it uses the same selectors as Firefox's or Chrome's inspect element. For example:

the element selector is 'div#mw-head.noprint'

The inspected element selector is 'div#mw-head.noprint'. So in pyquery, you just need to pass this selector:

pq('div#mw-head.noprint')
chris Frisina
  • 19,086
  • 22
  • 87
  • 167
YusuMishi
  • 2,317
  • 1
  • 18
  • 8
50

Here you can read more about different HTML parsers in Python and their performance. Even though the article is a bit dated it still gives you a good overview.

Python HTML parser performance

I'd recommend BeautifulSoup even though it isn't built in. Just because it's so easy to work with for those kinds of tasks. Eg:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)

x = soup.body.find('div', attrs={'class' : 'container'}).text
sbell
  • 457
  • 6
  • 13
Qiau
  • 5,976
  • 3
  • 29
  • 40
  • 2
    I was looking for something that details features/functionality rather than performance/efficiency. EDIT: Sorry for the pre-mature answer, that link is actually good. Thanks. – ffledgling Jul 29 '12 at 12:10
  • The first point-list kinds of summarize the features and functions :) – Qiau Jul 29 '12 at 12:12
  • 8
    If you use BeautifulSoup4 (latest version): `from bs4 import BeautifulSoup` – Franck Dernoncourt May 22 '14 at 03:04
  • The parser perf article has moved (its from 2008 though so might be out of date) to: https://ianbicking.org/blog/2008/03/python-html-parser-performance.html – kristianp Dec 13 '22 at 05:48
35

Compared to the other parser libraries lxml is extremely fast:

And with cssselect it’s quite easy to use for scraping HTML pages too:

from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
    print '%s: %s' % (div.text_content(), div.get('href'))

lxml.html Documentation

Lenar Hoyt
  • 5,971
  • 6
  • 49
  • 59
  • HTTPS not supported – Sergio May 25 '19 at 23:22
  • @Sergio use `import requests`, save buffer to file: https://stackoverflow.com/a/14114741/1518921 (or urllib), after load saved file using parse, `doc = parse('localfile.html').getroot()` – Protomen May 28 '19 at 12:30
  • 3
    I parses huge HTMLs for a specific data. Doing it with **BeautifulSoup** took `1.7` sec, but applying **lxml** instead, boosted it nearly `*100` times FASTER! If care about performance, **lxml** is the best option – Alex-Bogdanov May 29 '20 at 15:52
  • On the other hand, lxml carries a 12MB C extension. Mostly insignificant, but might be depends on what you do (in rare cases). – user202729 Jan 24 '22 at 07:27
11

I recommend lxml for parsing HTML. See "Parsing HTML" (on the lxml site).

In my experience Beautiful Soup messes up on some complex HTML. I believe that is because Beautiful Soup is not a parser, rather a very good string analyzer.

  • 4
    AIUI Beautiful Soup can be made to work with most "backend" XML parsers, lxml seems to be one of the supported parsers http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser – ffledgling Oct 25 '14 at 20:49
  • @ffledgling Some functions of BeautifulSoup are quite sluggish however. – Lenar Hoyt Nov 08 '14 at 01:22
2

I recommend using justext library:

https://github.com/miso-belica/jusText

Usage: Python2:

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print paragraph.text

Python3:

import requests
import justext

response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print (paragraph.text)
Wesam Na
  • 2,364
  • 26
  • 23
0

I would use EHP

https://github.com/iogf/ehp

Here it is:

from ehp import *

doc = '''<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>
'''

html = Html()
dom = html.feed(doc)
for ind in dom.find('div', ('class', 'container')):
    print ind.text()

Output:

Something here
Something else