2

I am crawling one website and parsing some content+images however even for simple site with 100 pages or so it is taking hours to do the job. I am using following settings. Any help would be highly appreciated. I have already seen this question - Scrapy 's Scrapyd too slow with scheduling spiders but couldn't gather much insight.

EXTENSIONS = {'scrapy.contrib.logstats.LogStats': 1}
LOGSTATS_INTERVAL = 60.0
RETRY_TIMES = 4
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 12
CONCURRENT_ITEMS = 200
DOWNLOAD_DELAY = 0.75
Community
  • 1
  • 1
Pradeep Kumar Mishra
  • 10,839
  • 4
  • 25
  • 26

1 Answers1

5

Are you sure the website is responding OK?

Setting DOWNLOAD_DELAY = 0.75 will force requests to be sequential and add a delay of 0.75 seconds between them. Your crawl will certainly be faster if you remove this, however, with 12 concurrent requests per domain be careful you are not hitting websites too aggressively.

Even with the delay it should not take hours, so that's why I am wondering if the website is slow or unresponsive. Some websites will do this to bots.

Shane Evans
  • 2,234
  • 16
  • 15
  • Well it's happening with all the websites. So I am concerned if Scrapy architecture is scalable enough for such work. Btw what should be expected rate in standard conditions. e.g. N number of pages/hr or so? – Pradeep Kumar Mishra Aug 16 '12 at 04:38
  • Usually you'd crawl a few hundred pages in seconds, if your bot is not network bound. The problem is not with scrapy architecture for sure. It's more likely something else like settings, your hardware/network, sites being crawled, your spider code, etc. – Shane Evans Aug 23 '12 at 09:29
  • 2
    Shane, when you say " DOWNLOAD_DELAY = 0.75 will force requests to be sequential.." do you mean the `CONCURRENT_REQUESTS` setting will be ignored? – Alexander Suraphel Mar 27 '16 at 09:24
  • No, setting DOWNLOAD_DELAY **won't** make CONCURRENT_REQUESTS be ignored – Done Data Solutions Apr 26 '17 at 08:13