20

I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770 pages).

I have looked on the scrapy webpage and the mailing lists and stackoverflow, but I can't seem to find generic recommendations for writing fast crawlers understandable for beginners. Maybe my problem is not the spider itself, but the way i run it. All suggestions welcome!

I have listed my code below, if it's needed.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import re

class Sale(Item):
    Adresse = Field()
    Pris = Field()
    Salgsdato = Field()
    SalgsType = Field()
    KvmPris = Field()
    Rum = Field()
    Postnummer = Field()
    Boligtype = Field()
    Kvm = Field()
    Bygget = Field()

class HouseSpider(BaseSpider):
    name = 'House'
    allowed_domains = ["http://boliga.dk/"]
    start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 23770, 1)]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("id('searchresult')/tr")
        items = []      
        for site in sites:
            item = Sale()
            item['Adresse'] = site.select("td[1]/a[1]/text()").extract()
            item['Pris'] = site.select("td[2]/text()").extract()
            item['Salgsdato'] = site.select("td[3]/text()").extract()
            Temp = site.select("td[4]/text()").extract()
            Temp = Temp[0]
            m = re.search('\r\n\t\t\t\t\t(.+?)\r\n\t\t\t\t', Temp)
            if m:
                found = m.group(1)
                item['SalgsType'] = found
            else:
                item['SalgsType'] = Temp
            item['KvmPris'] = site.select("td[5]/text()").extract()
            item['Rum'] = site.select("td[6]/text()").extract()
            item['Postnummer'] = site.select("td[7]/text()").extract()
            item['Boligtype'] = site.select("td[8]/text()").extract()
            item['Kvm'] = site.select("td[9]/text()").extract()
            item['Bygget'] = site.select("td[10]/text()").extract()
            items.append(item)
        return items

Thanks!

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Mace
  • 1,259
  • 4
  • 16
  • 35
  • 1
    The first thing you can do about this is to use threads (see the relevant infos in the standard library doc), to run, say, 5/10 downloads at the same time, which may obviously result in a big execution time improvement. Apart from this, I don't see any easy way to speed up the things, as your code seems straightforward. – michaelmeyer Jun 10 '13 at 18:08
  • @doukremt: Thanks! I have looked at the documentation, and it seems pretty simple for what I need it for. Is it correctly understood that for each connection, I should call `thread.start_new_thread(parse)`? Or will I just get two connections each scraping all the 23770 pages? – Mace Jun 10 '13 at 18:22
  • scrapy is actually async, so it actually downloads in parallel (you can set how many concurrent requests it makes). – Capi Etheriel Jun 10 '13 at 22:43
  • @barraponto: In words that I as an imbecile will understand: Having two concurrent request will together get the 23770 pages, and not each download all of them, right? :) – Mace Jun 11 '13 at 19:09
  • @Mace scrapy is single threaded, but it downloads in parallel and processes the responses while it waits for the answer... that's what non-blocking or async means. it could, of course, be multithreaded, but it would add complexity to the code without real advantages since your throttle is the network calls, not the parsing code. – Capi Etheriel Jun 12 '13 at 21:21

4 Answers4

40

Here's a collection of things to try:

  • use latest scrapy version (if not using already)
  • check if non-standard middlewares are used
  • try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)
  • turn off logging LOG_ENABLED = False (docs)
  • try yielding an item in a loop instead of collecting items into the items list and returning them
  • use local cache DNS (see this thread)
  • check if this site is using download threshold and limits your download speed (see this thread)
  • log cpu and memory usage during the spider run - see if there are any problems there
  • try run the same spider under scrapyd service
  • see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
  • try running Scrapy on pypy, see Running Scrapy on PyPy

Hope that helps.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks! Are the points ordered after relevance/performance improvement? – Mace Jun 10 '13 at 18:59
  • Well, no special order here, but I'd check for non-related to scrapy issues, like site download limit first. – alecxe Jun 10 '13 at 19:08
  • How do I check that? I have looked at the thread, but I can't see where it mentions how to test if this is the case? – Mace Jun 10 '13 at 19:14
  • Well, actually not sure how to know that exactly except that running performance tests and measuring the download speed. Looks like not that trivial to do.. – alecxe Jun 10 '13 at 19:29
5

Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache

Capi Etheriel
  • 3,542
  • 28
  • 49
  • Thanks, I will try that. I have tried several of the points @alecxe mentioned. In the beginning the scraping is very fast, but then it becomes quite slow, and I get failed scrapes, because the scrapes take more than 180 seconds. Without knowing for sure, it seems like I'm either hitting the page too hard, or they reduce their speed of reply since all requests come from the same IP. Any thoughts on this? – Mace Jun 11 '13 at 18:54
  • @barraponto: Setting `HTTPCACHE_ENABLED ` equal to `True` really helped!! Now my problem is that I get th "500 Internal Server Error" a lot. I tried setting the delay time to 5 secs and `CONCURRENT_REQUESTS_PER_DOMAIN = 2`, but it doesn't help. – Mace Jun 11 '13 at 20:09
  • @Mace 500 might happen when the server crashes, which is probably noticed by the mantainers and might lead them into blocking your API or using measures to prevent automated access. I'd suggest using some proxies, maybe give Crawlera a try: http://crawlera.com/ – Capi Etheriel Jun 12 '13 at 21:13
  • Disclaimer: I work for ScrapingHub, the company that develops both Scrapy and Crawlera. – Capi Etheriel Jun 12 '13 at 21:13
  • @barraponto I think I indeed did create a crash, but now its up and running again and I am not blocked. I will just crawl less aggresively and I think I will be fine - Today there has been no problems. – Mace Jun 12 '13 at 22:46
0

I work also on web scraping, using optimized C#, and it ends up CPU bound, so I am switching to C.

Parsing HTML blows the CPU data cache, and pretty sure your CPU is not using SSE 4.2 at all, as you can only access this feature using C/C++.

If you do the math, you are quickly compute bound but not memory bound.

Avlin
  • 500
  • 4
  • 20
0

One workaround to speed up your scrapy is to config your start_urls appropriately.

For example, If our target data is in http://apps.webofknowledge.com/doc=1 where the doc number range from 1 to 1000, you can config your start_urls in followings:

 start_urls = [
    "http://apps.webofknowledge.com/doc=250",
    "http://apps.webofknowledge.com/doc=750",
]

In this way, requests will start from 250 to 251,249 and from 750 to 751,749 simultaneously, so you will get 4 times faster compared to start_urls = ["http://apps.webofknowledge.com/doc=1"].

zhaoqing
  • 775
  • 8
  • 8