1

I'm using urllib.urlopen() method and BeautfulSoup for crawling. I am not satisfied with browsing speed, and I'm thinking about what urllib is parsing, guessing it got to load more than html only. Couldn't find in docs does it read or checks bigger data (images, flash, ...) by default.

So, If urllib has to load i.e. images, flash, js... how to avoid GET request for such data types?

Kara
  • 6,115
  • 16
  • 50
  • 57
Alex
  • 3,167
  • 6
  • 35
  • 50
  • Are you trying to load multiple sites at the same time? – Floris Apr 01 '14 at 13:47
  • 1
    Take a look at the question [here](http://stackoverflow.com/questions/4119680/multiple-asynchronous-connections-with-urllib2-or-other-http-library) - maybe you can use those techniques to have more requests at the same time. Can make a big difference (with enough bandwidth, most of the delay is "waiting"). – Floris Apr 01 '14 at 13:50
  • 2
    You might check out Scrapy for web crawling in Python. http://scrapy.org/ It will process web pages in parallel by default. – bgschiller Apr 01 '14 at 13:57

2 Answers2

3

Try requests - it implements HTTP connection pools which speeds up the crawling.

Also, it takes care of other things like cookies, auth etc. much better than urllib and works great with BeautfulSoup..

GabiMe
  • 18,105
  • 28
  • 76
  • 113
2

Use threading! It's super simple. Here's an example. You can change the number of connections to suit your needs.

import threading, Queue
import urllib

urls = [
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',    
    ]

queue = Queue.Queue()
for x,url in enumerate(urls):
    filename = "datafile%s-%s" % (x,url)
    queue.put((url, filename))


num_connections = 10

class WorkerThread(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        while 1:
            try:
                url, filename = self.queue.get_nowait()
            except Queue.Empty:
                raise SystemExit

            urllib.urlretrieve(url,filename.replace('http://',''))

# start threads
threads = []
for dummy in range(num_connections):
    t = WorkerThread(queue)
    t.start()
    threads.append(t)


# Wait for all threads to finish
for thread in threads:
    thread.join()
Genome
  • 1,106
  • 8
  • 10
  • As I can see, that is solution with several threads. I wonder how to eliminate non html content. – Alex Apr 01 '14 at 15:58
  • You could use a "black list" to skip over the urls that have content you don't need. For instance... blacklist = ['.jpeg','.jpg','.gif'] – Genome Apr 02 '14 at 02:12