2

I have a reasonably long list of websites that I want to download the landing (index.html or equivalent) pages for. I am currently using Scrapy (much love to the guys behind it -- this is a fabulous framework). Scrapy is slower on this particular task than I'd like and I am wondering if wget or an other alternative would be faster given how straightforward the task is. Any ideas?

(Here's what I am doing with Scrapy. Anything I can do to optimize scrapy for this task? )

So, I have a start URLs list like

start_urls=[google.com yahoo.com aol.com]

And I scrape the text from each response and store this in an xml. I need to turn of the offsitemiddleware to allow for multiple domains.

Scrapy works as expected, but seems slow (About 1000 in an hour or 1 every 4 seconds). Is there a way to speed this up by increasing the number of CONCURRENT_REQUESTS_PER_SPIDER while running a single spider? Anything else?

Cygorger
  • 772
  • 7
  • 15
  • Nowadays there are excellent Python libs you might want to use - [urllib3](http://urllib3.readthedocs.org/) (uses thread pools) and [requests](http://docs.python-requests.org/) (uses thread pools through urllib3 or non blocking IO through [gevent](http://www.gevent.org/)) – Piotr Dobrogost Jan 26 '12 at 11:19
  • Instead of programming, you could push a plaintext list of .index html pages into [HTTrack](http://www.httrack.com/page/9/en/index.html), and set the crawler to link to 0 pages deep. Note that this software will only create a copy of the index pages on your local machine, viewable offline. – yurisich Jan 26 '12 at 11:37
  • In Scrapy 0.14+ you will want to adjust `CONCURRENT_REQUESTS` instead of the old `CONCURRENT_REQUESTS_PER_SPIDER` setting. – Pablo Hoffman Feb 19 '12 at 07:12

1 Answers1

4

If you want a way to concurrently download multiple sites with python, you can do so with the standard libraries like this:

import threading
import urllib

maxthreads = 4

sites = ['google.com', 'yahoo.com', ] # etc.

class Download(threading.Thread):
   def run (self):
       global sites
       while sites:
           site = sites.pop()
           print "start", site
           urllib.urlretrieve('http://' + site, site)
           print "end  ", site

for x in xrange(min(maxthreads, len(sites))):
    Download().start()

You could also check out httplib2 or PycURL to do the downloading for you instead of urllib.

I'm not clear exactly how you want the scraped text as xml to look, but you could use xml.etree.ElementTree from the standard library or you could install BeautifulSoup (which would be better as it handles malformed markup).

Ian Mackinnon
  • 13,381
  • 13
  • 51
  • 67
  • Do you think `.pop()` is thread safe? See [Are Python built-in container thread-safe?](http://stackoverflow.com/q/2227169/95735) – Piotr Dobrogost Jan 26 '12 at 10:44