Going through HTML DOM in Python

Question

I'm looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element.

I currently have this:

#!/usr/bin/env python
import urllib.request

def getSite(url):
    return urllib.request.urlopen(url)

if __name__ == '__main__':
    content = getSite('http://www.google.com').read()
    print(content)

When I print content it does print out the entire html page which is something close to what I want... although I would ideally like to be able to navigate through the DOM rather then treating it as a giant string.

I'm still fairly new to Python but have experience with multiple other languages (mainly Java, C#, C++, C, PHP, JS). I've done something similar with Java before but wanted to try it out in Python.

You should use something like [BeautifulSoup](https://beautiful-soup-4.readthedocs.org/en/latest/) for this. — rnevius, Mar 12 '15 at 03:21
Close to be duplicate of [Parsing HTML Python](http://stackoverflow.com/questions/11709079/parsing-html-python). — alecxe, Mar 12 '15 at 03:25

score 11 · Accepted Answer · answered Mar 12 '15 at 04:00

11

There are many different modules you could use. For example, lxml or BeautifulSoup.

Here's an lxml example:

import lxml.html

mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)

description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag

>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

And a BeautifulSoup example:

from bs4 import BeautifulSoup

mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)

description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute

>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

Notice how BeautifulSoup returns a unicode string, while lxml does not. This can be useful/hurtful depending on what is needed.

answered Mar 12 '15 at 04:00

Zach Gates

4,045
1
27
51

It seems that trying to use BeautifulSoup gives me an error as I'm using Python 3.4.3. – Jake Alsemgeest Mar 12 '15 at 23:05
'File "find.py", line 3, in from bs4 import BeautifulSoup File "C:\Users\Jake\Desktop\bs4\__init__.py", line 175 except Exception, e: ^ SyntaxError: invalid syntax' I looked it up and it seems to be something to do with the fact that it's a 2.x library? – Jake Alsemgeest Mar 13 '15 at 02:59
1

Can someone please tell me as to why do people suggest BeautifulSoup or lxml over the native html parser? – Shatu Aug 25 '17 at 20:34
2

@Shatu: Modules like `BeautifulSoup` and `lxml` are better in performance, generally speaking. – Zach Gates Aug 25 '17 at 20:45
@Zach: Performance in terms of their ability to parse not-so-well-formed html or in terms of time taken to parse? – Shatu Aug 25 '17 at 20:47
1

@Shatu: Speed, memory usage, etc. I'm unsure how either perform with malformed data – Zach Gates Aug 25 '17 at 20:53

Boa · Answer 2 · 2015-03-12T03:30:53.947

3

Check out the BeautifulSoup module.

from bs4 import BeautifulSoup
import urllib                                       
soup = BeautifulSoup(urllib.urlopen("http://google.com").read())

for link in soup.find_all('a'):
    print(link.get('href'))

edited Mar 12 '15 at 03:30

answered Mar 12 '15 at 03:23

Boa

2,609
1
23
38

2

Hiya, this may well solve the problem... but it'd be good if you could edit your answer and provide a little explanation about how and why it works :) Don't forget - there are heaps of newbies on Stack overflow, and they could learn a thing or two from your expertise - what's obvious to you might not be so to them. – Taryn East Mar 12 '15 at 04:14

Going through HTML DOM in Python

2 Answers2

Linked