2

I'd like to write a crawler using python. This means: I've got the url of some websites' home page, and I'd like my program to crawl through all the website following links that stay into the website. How can I do this easily and FAST? I tried BeautifulSoup already, but it is really cpu consuming and quite slow on my pc.

Kara
  • 6,115
  • 16
  • 50
  • 57
Matteo Monti
  • 8,362
  • 19
  • 68
  • 114

4 Answers4

6

I'd recommend using mechanize in combination with lxml.html. as robert king suggested, mechanize is probably best for navigating through the site. for extracting elements I'd use lxml. lxml is much faster than BeautifulSoup and probably the fastest parser available for python. this link shows a performance test of different html parsers for python. Personally I'd refrain from using the scrapy wrapper.

I haven't tested it, but this is probably what youre looking for, first part is taken straight from the mechanize documentation. the lxml documentation is also quite helpful. especially take a look at this and this section.

import mechanize
import lxml.html

br = mechanize.Browser()
response = br.open("somewebsite")

for link in br.links():
    print link
    br.follow_link(link)  # takes EITHER Link instance OR keyword args
    print br
    br.back()

# you can also display the links with lxml
html = response.read()
root = lxml.html.fromstring(html)
for link in root.iterlinks():
    print link

you can also get elements via root.xpath(). A simple wget might even be the easiest solution.

Hope I could be helpful.

ilprincipe
  • 856
  • 6
  • 23
  • I don't think the first `for` loop is doing what you intend. – cerberos Jul 11 '11 at 12:53
  • i have only used mechanize for navigating through forms so far. isnt it simply printing each link on the site and then following it? br.back might be unnecessary. but it seems to me like its doing exactly that. posts [here](http://stackoverflow.com/questions/3569622/python-mechanize-following-link-by-url-and-what-is-the-nr-parameter) are similar. But since i am pretty new to python/programming in general, please correct me if I am wrong. – ilprincipe Jul 11 '11 at 13:50
  • Anyway, I started using lxml and everything got faster! Thank you! – Matteo Monti Jul 11 '11 at 17:53
3

I like using mechanize. Its fairly simple, you download it and create a browser object. With this object you can open a URL. You can use "back" and "forward" functions as in a normal browser. You can iterate through the forms on the page and fill them out if need be. You can iterate through all the links on the page too. Each link object has the url etc which you could click on.

here is an example: Download all the links(related documents) on a webpage using Python

Community
  • 1
  • 1
Rusty Rob
  • 16,489
  • 8
  • 100
  • 116
3

Here's an example of a very fast (concurrent) recursive web scraper using eventlet. It only prints the urls it finds but you can modify it to do what you want. Perhaps you'd want to parse the html with lxml (fast), pyquery (slower but still fast) or BeautifulSoup (slow) to get the data you want.

cerberos
  • 7,705
  • 5
  • 41
  • 43
2

Have a look at scrapy (and related questions). As for performance... very difficult to make any useful suggestions without seeing the code.

Community
  • 1
  • 1
Rob Cowie
  • 22,259
  • 6
  • 62
  • 56