5

I'm looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element.

I currently have this:

#!/usr/bin/env python
import urllib.request

def getSite(url):
    return urllib.request.urlopen(url)

if __name__ == '__main__':
    content = getSite('http://www.google.com').read()
    print(content)

When I print content it does print out the entire html page which is something close to what I want... although I would ideally like to be able to navigate through the DOM rather then treating it as a giant string.

I'm still fairly new to Python but have experience with multiple other languages (mainly Java, C#, C++, C, PHP, JS). I've done something similar with Java before but wanted to try it out in Python.

TylerH
  • 20,799
  • 66
  • 75
  • 101
Jake Alsemgeest
  • 692
  • 2
  • 13
  • 25

2 Answers2

11

There are many different modules you could use. For example, lxml or BeautifulSoup.

Here's an lxml example:

import lxml.html

mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)

description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag

>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

And a BeautifulSoup example:

from bs4 import BeautifulSoup

mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)

description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute

>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

Notice how BeautifulSoup returns a unicode string, while lxml does not. This can be useful/hurtful depending on what is needed.

Zach Gates
  • 4,045
  • 1
  • 27
  • 51
  • It seems that trying to use BeautifulSoup gives me an error as I'm using Python 3.4.3. – Jake Alsemgeest Mar 12 '15 at 23:05
  • 'File "find.py", line 3, in from bs4 import BeautifulSoup File "C:\Users\Jake\Desktop\bs4\__init__.py", line 175 except Exception, e: ^ SyntaxError: invalid syntax' I looked it up and it seems to be something to do with the fact that it's a 2.x library? – Jake Alsemgeest Mar 13 '15 at 02:59
  • 1
    Can someone please tell me as to why do people suggest BeautifulSoup or lxml over the native html parser? – Shatu Aug 25 '17 at 20:34
  • 2
    @Shatu: Modules like `BeautifulSoup` and `lxml` are better in performance, generally speaking. – Zach Gates Aug 25 '17 at 20:45
  • @Zach: Performance in terms of their ability to parse not-so-well-formed html or in terms of time taken to parse? – Shatu Aug 25 '17 at 20:47
  • 1
    @Shatu: Speed, memory usage, etc. I'm unsure how either perform with malformed data – Zach Gates Aug 25 '17 at 20:53
3

Check out the BeautifulSoup module.

from bs4 import BeautifulSoup
import urllib                                       
soup = BeautifulSoup(urllib.urlopen("http://google.com").read())

for link in soup.find_all('a'):
    print(link.get('href'))
Boa
  • 2,609
  • 1
  • 23
  • 38
  • 2
    Hiya, this may well solve the problem... but it'd be good if you could edit your answer and provide a little explanation about how and why it works :) Don't forget - there are heaps of newbies on Stack overflow, and they could learn a thing or two from your expertise - what's obvious to you might not be so to them. – Taryn East Mar 12 '15 at 04:14