How to speed up web crawling in Python?

Question

I'm using urllib.urlopen() method and BeautfulSoup for crawling. I am not satisfied with browsing speed, and I'm thinking about what urllib is parsing, guessing it got to load more than html only. Couldn't find in docs does it read or checks bigger data (images, flash, ...) by default.

So, If urllib has to load i.e. images, flash, js... how to avoid GET request for such data types?

Take a look at the question [here](http://stackoverflow.com/questions/4119680/multiple-asynchronous-connections-with-urllib2-or-other-http-library) - maybe you can use those techniques to have more requests at the same time. Can make a big difference (with enough bandwidth, most of the delay is "waiting"). — Floris, Apr 01 '14 at 13:50
You might check out Scrapy for web crawling in Python. http://scrapy.org/ It will process web pages in parallel by default. — bgschiller, Apr 01 '14 at 13:57

GabiMe · Answer 1 · 2014-04-01T18:22:13.390

3

Try requests - it implements HTTP connection pools which speeds up the crawling.

Also, it takes care of other things like cookies, auth etc. much better than urllib and works great with BeautfulSoup..

edited Apr 01 '14 at 18:22

answered Apr 01 '14 at 14:27

GabiMe

18,105
28
76
113

score 2 · Answer 2 · answered Apr 01 '14 at 14:09

Use threading! It's super simple. Here's an example. You can change the number of connections to suit your needs.

import threading, Queue
import urllib

urls = [
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',    
    ]

queue = Queue.Queue()
for x,url in enumerate(urls):
    filename = "datafile%s-%s" % (x,url)
    queue.put((url, filename))


num_connections = 10

class WorkerThread(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        while 1:
            try:
                url, filename = self.queue.get_nowait()
            except Queue.Empty:
                raise SystemExit

            urllib.urlretrieve(url,filename.replace('http://',''))

# start threads
threads = []
for dummy in range(num_connections):
    t = WorkerThread(queue)
    t.start()
    threads.append(t)


# Wait for all threads to finish
for thread in threads:
    thread.join()

As I can see, that is solution with several threads. I wonder how to eliminate non html content. — Alex, Apr 01 '14 at 15:58
You could use a "black list" to skip over the urls that have content you don't need. For instance... blacklist = ['.jpeg','.jpg','.gif'] — Genome, Apr 02 '14 at 02:12

How to speed up web crawling in Python?

2 Answers2