8

I am experiencing slow crawl speeds with scrapy (around 1 page / sec). I'm crawling a major website from aws servers so I don't think its a network issue. Cpu utilization is nowhere near 100 and if I start multiple scrapy processes crawl speed is much faster.

Scrapy seems to crawl a bunch of pages, then hangs for several seconds, and then repeats.

I've tried playing with: CONCURRENT_REQUESTS = CONCURRENT_REQUESTS_PER_DOMAIN = 500

but this doesn't really seem to move the needle past about 20.

somewire
  • 231
  • 3
  • 9
  • which scrapy version? Any non-default extentions/middleware? pauses could be some blocking code, could you be doing something (e.g. writing data to DB, uploading to s3, etc.) in the reactor thread that is blocking scrapy? – Shane Evans Nov 22 '12 at 09:21
  • @somewire check CPU|HDD|Network utilisation with just scrapping without parsing page with lxml. Set `LOG_LEVEL = 'DEBUG'` – b1_ Feb 23 '13 at 07:59

1 Answers1

2

Are you sure you are allowed to crawl the destination site at high speed? Many sites implement download threshold and "after a while" start responding slowly.

gvtech
  • 375
  • 2
  • 2
  • You are right if request return 503 and if frontend server is nginx http://nginx.org/en/docs/http/ngx_http_limit_conn_module.html – b1_ Feb 23 '13 at 08:03