Parsing HTML using Python

Question

I'm looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects.

If I have a document of the form:

<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>

then it should give me a way to access the nested tags via the name or id of the HTML tag so that I can basically ask it to get me the content/text in the div tag with class='container' contained within the body tag, or something similar.

If you've used Firefox's "Inspect element" feature (view HTML) you would know that it gives you all the tags in a nice nested manner like a tree.

I'd prefer a built-in module but that might be asking a little too much.

I went through a lot of questions on Stack Overflow and a few blogs on the internet and most of them suggest BeautifulSoup or lxml or HTMLParser but few of these detail the functionality and simply end as a debate over which one is faster/more efficent.

like all the other answerers, I would recommend BeautifulSoup because it is really good in handling broken HTML files. — Pascal Rosin, Jul 29 '12 at 12:24

score 274 · Accepted Answer · edited Oct 26 '19 at 17:44

274

So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.

edited Oct 26 '19 at 17:44

answered Jul 29 '12 at 12:12

Aadaam

3,579
1
14
9

4

What exactly is the parsed_html object? – ffledgling Jul 29 '12 at 12:21
2

parsed_html is a BeautifulSoup object, think of it like a DOMElement or DOMDocument, except it has "tricky" properties, like "body" will refer to the BeautifulSoup object (remember, it's a tree node basically) of the first (and in this case, only) body element of the root element (in our case, html) – Aadaam Jul 29 '12 at 12:38
5

General info: If performance is critical, better use the `lxml` library instead (see answer below). With `cssselect` it’s quite useful aswell and performance is often 10- to 100-fold better than the other libraries available. – Lenar Hoyt Nov 08 '14 at 01:04
note: `class` attribute is special: `BeautifulSoup(html).find('div', 'container').text` – jfs Mar 10 '16 at 17:17
@mcb: if `lxml` is installed; `BeautifulSoup` can use it to parse html. – jfs Mar 10 '16 at 17:18
@J.F.Sebastian I know, as far as I remember the BeatifulSoup front end makes it slow, but I might be wrong. Did you compare it? – Lenar Hoyt Mar 10 '16 at 17:22
@mcb I don't remember having performance issues with it. YMMV. You could try to pass `parse_only=SoupStrainer(*interesting_parts)` and see if it helps. – jfs Mar 10 '16 at 17:30
2

`parsed_html = BeautifulSoup(html)` doesn't work for me, `parsed_html = BeautifulSoup(html, 'html.parser')` does – Pavel Mar 14 '17 at 12:11
2

@BaileyParker you'd think in a sea of people constantly dealing with python2, python3, c++11, c++17 Opencv 4.3, and Java 2021, someone would have ****ing thought of naming it `import bs` so when they change their minds with bs5 they don't break everyone's code yet again – Nathan majicvr.com Jun 10 '20 at 10:41
Remark: beautifulsoup supports css selector too, see https://stackoverflow.com/questions/24801548/how-to-use-css-selectors-to-retrieve-specific-links-lying-in-some-class-using-be – so it's not "less user friendly" ⟺ also "jquery-like". – user202729 Jan 24 '22 at 07:28
3

@Nathan To be fair, major version update means major incompatible change, so it's likely that the code would break in one way or the other anyway. Better to break early than late. – user202729 Jan 24 '22 at 07:32

score 106 · Answer 2 · edited Jul 19 '18 at 15:22

106

I guess what you're looking for is pyquery:

pyquery: a jquery-like library for python.

An example of what you want may be like:

from pyquery import PyQuery    
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or     tag = pq('div.class')
print tag.text()

And it uses the same selectors as Firefox's or Chrome's inspect element. For example:

the element selector is 'div#mw-head.noprint'

The inspected element selector is 'div#mw-head.noprint'. So in pyquery, you just need to pass this selector:

pq('div#mw-head.noprint')

edited Jul 19 '18 at 15:22

chris Frisina

19,086
22
87
167

answered Jul 29 '12 at 12:47

YusuMishi

2,317
1
18
8

Quite useful for someone coming from a jQuery frontend! – Jay Dadhania Jul 20 '21 at 12:39
3

Remark. This library uses [lxml](https://stackoverflow.com/a/26812589/5267751) under the hood. – user202729 Jan 24 '22 at 07:20

score 50 · Answer 3 · edited Feb 17 '23 at 20:11

50

Here you can read more about different HTML parsers in Python and their performance. Even though the article is a bit dated it still gives you a good overview.

Python HTML parser performance

I'd recommend BeautifulSoup even though it isn't built in. Just because it's so easy to work with for those kinds of tasks. Eg:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)

x = soup.body.find('div', attrs={'class' : 'container'}).text

edited Feb 17 '23 at 20:11

sbell

457
6
13

answered Jul 29 '12 at 12:07

Qiau

5,976
3
29
40

2

I was looking for something that details features/functionality rather than performance/efficiency. EDIT: Sorry for the pre-mature answer, that link is actually good. Thanks. – ffledgling Jul 29 '12 at 12:10
The first point-list kinds of summarize the features and functions :) – Qiau Jul 29 '12 at 12:12
8

If you use BeautifulSoup4 (latest version): `from bs4 import BeautifulSoup` – Franck Dernoncourt May 22 '14 at 03:04
The parser perf article has moved (its from 2008 though so might be out of date) to: https://ianbicking.org/blog/2008/03/python-html-parser-performance.html – kristianp Dec 13 '22 at 05:48

Lenar Hoyt · Answer 4 · 2014-11-08T01:21:28.117

35

Compared to the other parser libraries lxml is extremely fast:

And with cssselect it’s quite easy to use for scraping HTML pages too:

from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
    print '%s: %s' % (div.text_content(), div.get('href'))

lxml.html Documentation

edited Nov 08 '14 at 01:21

answered Nov 08 '14 at 01:08

Lenar Hoyt

5,971
6
49
59

HTTPS not supported – Sergio May 25 '19 at 23:22
@Sergio use `import requests`, save buffer to file: https://stackoverflow.com/a/14114741/1518921 (or urllib), after load saved file using parse, `doc = parse('localfile.html').getroot()` – Protomen May 28 '19 at 12:30
3

I parses huge HTMLs for a specific data. Doing it with **BeautifulSoup** took `1.7` sec, but applying **lxml** instead, boosted it nearly `*100` times FASTER! If care about performance, **lxml** is the best option – Alex-Bogdanov May 29 '20 at 15:52
On the other hand, lxml carries a 12MB C extension. Mostly insignificant, but might be depends on what you do (in rare cases). – user202729 Jan 24 '22 at 07:27

score 11 · Answer 5 · answered Oct 25 '14 at 18:50

11

I recommend lxml for parsing HTML. See "Parsing HTML" (on the lxml site).

In my experience Beautiful Soup messes up on some complex HTML. I believe that is because Beautiful Soup is not a parser, rather a very good string analyzer.

answered Oct 25 '14 at 18:50

Love and peace - Joe Codeswell

2,835
3
39
50

4

AIUI Beautiful Soup can be made to work with most "backend" XML parsers, lxml seems to be one of the supported parsers http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser – ffledgling Oct 25 '14 at 20:49
@ffledgling Some functions of BeautifulSoup are quite sluggish however. – Lenar Hoyt Nov 08 '14 at 01:22

score 2 · Answer 6 · answered Jul 15 '16 at 15:51

I recommend using justext library:

https://github.com/miso-belica/jusText

Usage: Python2:

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print paragraph.text

Python3:

import requests
import justext

response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print (paragraph.text)

score 0 · Answer 7 · answered Mar 20 '16 at 09:44

0

I would use EHP

https://github.com/iogf/ehp

Here it is:

from ehp import *

doc = '''<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>
'''

html = Html()
dom = html.feed(doc)
for ind in dom.find('div', ('class', 'container')):
    print ind.text()

Output:

Something here
Something else

answered Mar 20 '16 at 09:44

Unknown Soldier

37
2

8

Please explain. What would you use EHP over the popular BeautifulSoup or lxml? – ChaimG Sep 22 '16 at 01:57

Parsing HTML using Python

7 Answers7

Linked

Related