13

What is better method of scaling Scrapy?

  1. By running one scrapy process and increasing CONCURRENT_REQUESTS internal Scrapy's setting
  2. By running multiple scrapy processes but still focusing on increasing internal setting.
  3. By increasing quantity of scrapy prcesses with some constant value of internal setting.

If 3 - then what software is better to use for launching multiple scrapy processes?

And what is a best way to distribute scrapy across multiple servers?

Gill Bates
  • 14,330
  • 23
  • 70
  • 138

3 Answers3

17

Scrapyd is a great tool for managing Scrapy processes. But the best answer I can give is that it depends. First you need to figure out where your bottleneck is.

If it is CPU intensive parsing, you should use multiple processes. Scrapy is able to handle 1000s of requests in parallel through Twisted's implementation of the Reactor pattern. But it uses only one process and no multi-threading and so it will utilize only a single core.

If it is just the number of requests that is limiting speed, tweak concurrent requests. Test your internet speed. To test how much bandwidth you have Then, go to your network resources in your system monitor, run your spider and see how much bandwidth you use compared to your max. Increase your concurrent requests until you stop seeing performance increases. The stop point could be determined by the site capacity, though only for small sites, the sites anti-scraping/DDoS programs (assuming you don't have proxies or vpns), your bandwidth or another chokepoint in the system. The last thing to know is that, while requests are handled in an async manner, items are not. If you have a lot of text and write everything locally, it will block requests while it writes. You will see lulls on the system monitor network panel. You can tweak your concurrent items and maybe get a smoother network usage, but it will still take the same amount of time. If you are using db writes, consider an insert delayed, or a queue with an execute many after a threshold, or both. Here is a pipeline someone wrote to handle all db writes async. The last choke point could be memory. I have run into this issue on an AWS micro instance, though on a laptop, it probably isn't an issue. If you don't need them, considering disabling the cache, cookies, and dupefilter. Of course they can be very helpful. Concurrent Items and Requests also take up memory.

Done Data Solutions
  • 2,156
  • 19
  • 32
Will Madaus
  • 169
  • 1
  • 3
  • 1
    https://github.com/aivarsk/scrapy-proxies If you find yourself getting blocked, the above will randomly choose proxies from a list you feed it. Proxies will be slower than your network, but might be faster than the precautions you would have to take to avoid being blocked. – Will Madaus Feb 02 '16 at 15:46
  • regarding proxies, you could just use a load balancer with round robin. – Daniel Dror Mar 27 '17 at 05:00
8

Scrapyd was made exactly for deploying and running scrapy spiders. Basically it is a daemon that listens to requests for spiders to run. Scrapyd runs spiders in multiple processes, you can control the behavior with max_proc and max-proc-per-cpu settings:

max_proc

The maximum number of concurrent Scrapy process that will be started. If unset or 0 it will use the number of cpus available in the system multiplied by the value in max_proc_per_cpu option. Defaults to 0.

max_proc_per_cpu

The maximum number of concurrent Scrapy process that will be started per cpu. Defaults to 4.

It has a nice JSON API and provides a convenient way to deploy scrapy projects into scrapyd.

Also see:


Another option would be to use a different service, like Scrapy Cloud:

Scrapy Cloud bridges the highly efficient Scrapy development environment with a robust, fully-featured production environment to deploy and run your crawls. It's like a Heroku for Scrapy, although other technologies will be supported in the near future. It runs on top of the Scrapinghub platform, which means your project can scale on demand, as needed.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
3

This might not be exactly in your predefined choices, but for concurrency and delays management, you can improve your overall configuration by cutting off every hard limits in your internal settings and letting the Autothrottle extension work on that for you.

It will adjust your configuration according to the average domain latency for your requests and your ability to crawl at that speed as well. Adding a new domain also becomes easier since you don't have to worry about how to tweak your configuration for that domain.

I tried it for a project and the results were very interesting. There wasn't a huge performance drop, but reliability was improved. Most of all, it simplified everything a lot and reduced the risk of a crawl to fail due to throttling or overload, which was a concern in that project situation.

I know this question is old but I hope this will help someone looking for reliability as well.

Frederik.L
  • 5,522
  • 2
  • 29
  • 41