6

I'm trying to build a system to run a few dozen Scrapy spiders, save the results to S3, and let me know when it finishes. There are several similar questions on StackOverflow (e.g. this one and this other one), but they all seem to use the same recommendation (from the Scrapy docs): set up a CrawlerProcess, add the spiders to it, and hit start().

When I tried this method with all 325 of my spiders, though, it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it. I've tried a few things that haven't worked.

What is the recommended way to run a large number of spiders with Scrapy?

Edited to add: I understand I can scale up to multiple machines and pay for services to help coordinate (e.g. ScrapingHub), but I'd prefer to run this on one machine using some sort of process pool + queue so that only a small fixed number of spiders are ever running at the same time.

magneticMonster
  • 2,373
  • 6
  • 30
  • 46

3 Answers3

12

The simplest way to do this is to run them all from the command line. For example:

$ scrapy list | xargs -P 4 -n 1 scrapy crawl

Will run all your spiders, with up to 4 running in parallel at any time. You can then send a notification in a script once this command has completed.

A more robust option is to use scrapyd. This comes with an API, a minimal web interface, etc. It will also queue the crawls and only run a certain (configurable) number at once. You can interact with it via the API to start your spiders and send notifications once they are all complete.

Scrapy Cloud is a perfect fit for this [disclaimer: I work for Scrapinghub]. It will allow you only to run a certain number at once and has a queue of pending jobs (which you can modify, browse online, prioritize, etc.) and a more complete API than scrapyd.

You shouldn't run all your spiders in a single process. It will probably be slower, can introduce unforeseen bugs, and you may hit resource limits (like you did). If you run them separately using any of the options above, just run enough to max out your hardware resources (usually CPU/network). If you still get problems with file descriptors at that point you should increase the limit.

Shane Evans
  • 2,234
  • 16
  • 15
  • 1
    Thanks for the suggestions. I tried using scrapyd earlier, but getting it set up and configured, adding the spiders, scheduling them all to run, waiting for them to finish, getting the results in my desired format, and shutting the whole thing down became extremely complicated. I also tried Scrapinghub, but it is similarly complicated to run and is rather expensive. It doesn't sound like scrapy supports this kind of thing, so I'll stick with the xargs path for now. – magneticMonster Jan 04 '18 at 16:46
  • 1
    Absolutely nothing wrong with xargs! the others are more complicated, so it's only worth it if you can use the extra functionality. Scrapy Cloud starts from $9 per month, but the cost will depend on how many spiders you want to run at once. – Shane Evans Jan 04 '18 at 17:26
  • "I also tried Scrapinghub, but it is similarly complicated to run" What part do you find complicated? Creating the account, the project, deploying your code and/or running your spiders using the webapp or API?. -- " and is rather expensive." I would like to know what hostings are your alternatives if not too much asking. Do you plan to run your spiders periodically or just once?. – dangra Jan 05 '18 at 03:21
  • I'm trying to build a process to run all the spiders on an EC2 instance, concatenate the output of the spiders to S3 for storage, then shut down the instance. To get that to happen with scrapyd or scrapinghub I have to install scrapyd, install scrapyd CLI, build and deploy the egg, schedule all my spiders one by one, use polling to detect when they finish, download the output from scrapyd and convert it to my format, and upload to S3. I might as well just run scrapy crawl with xargs. – magneticMonster Jan 06 '18 at 04:23
  • Using EC2 directly I can use a bigger instance and pay for spot pricing only when I need it. For the $9/month cost of two threads on scrapinghub, I can run a 2-core machine in EC2 for 300 hours and get ~20 runs of all my spiders done. In practice I only want to do it 5 or 6 times per month, so it's significantly cheaper in my case and I'll get results faster and in the format I want more easily. – magneticMonster Jan 06 '18 at 04:33
  • 1
    Where in the docs is the command line solution, so that I can understand what's being done here? It's the simplest answer to this question I've found. I don't find anything in the docs about the -p or -n command line parameters, unless they are shorthand for something else. Thank you for posting this answer. – NFB Feb 04 '18 at 15:32
  • 1
    This isn't in the scrapy docs. The -n and -P arguments are arguments to xargs. The `scrapy list` command generates a list of spiders, which is piped to xargs. xargs calls `scrapy crawl` with the spiders. `-n` tells xargs to call scrapy crawl with only one spider at a time and `-P` allows it to run 4 processes concurrently. – Shane Evans Feb 07 '18 at 21:15
1

it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it

That's probably a sign that you need multiple machines to execute your spiders. A scalability issue. Well, you can also scale vertically to make your single machine more powerful but that would hit a "limit" much faster:

Check out the Distributed Crawling documentation and the scrapyd project.

There is also a cloud-based distributed crawling service called ScrapingHub which would take away the scalability problems from you altogether (note that I am not advertising them as I have no affiliation to the company).

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I don't want to run all the spiders at once – I want to run them in a pool of (say) 12 at a time until they finish. I'm happy to wait hours for all of these to finish, but I'd like for them to run on one machine. – magneticMonster Jan 04 '18 at 05:22
  • @magneticMonster you can use `scrapyd` for that as well, there is scheduling there. – alecxe Jan 04 '18 at 06:21
0

One solution, if the information is relatively static (based on your mention of the process "finishing"), is to simply set up a script to run the crawls sequentially or in batches. Wait for 1 to finish before starting the next 1 (or 10, or whatever the batch size is).

Another thing to consider if you're only using one machine and this error is cropping up -- having too many files open isn't really a resource bottleneck. You might be better off having each spider run 200 or so threads to make network IO (typically, though sometimes CPU or whatnot) the bottleneck. Each spider will finish faster on average than your current solution which executes them all at once and hits some "maximum file descriptor" limit rather than an actual resource limit.

Hans Musgrave
  • 6,613
  • 1
  • 18
  • 37