Queueing and monitoring concurrent spiders with CrawlProcess

Question

I have an actively developing Scrapy project of 70+ spiders that I want to run regularly (daily/weekly), with about 4 spiders running in concurrency, with ability to stop, start, and monitor the spiders individually.

I've ruled out using ScrapyD because it adds layers of complexity to what I already have and requires deploying a new egg for even a small change to any part of the project.

There's a more general version of this question here with an accepted answer using the command line:

$ scrapy list | xargs -P 4 -n 1 scrapy crawl

This a pretty good solution but requires working through the bash shell, which seems clumsy, and there is no way to stop or interact with spiders programmatically once running. I want to keep things as simple as they can be, preferably not leaving my Python IDE. I also need to be able to stop, add, or remove a spider from this queue without doing so in separate bash windows or starting over entirely.

I'm puzzled why there is no clear way to go about this in the docs. Looking at similar but not identical questions here, here and here, answers mostly point either to ScrapyD or to starting separate Twisted reactors separately, which seems complicated and potentially error-prone.

Using CrawlSpider, I know I can do something like:

from scrapy.spiderloader import SpiderLoader
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

settings = get_project_settings()
crawler = SpiderLoader.from_settings(settings)
spiders= SpiderLoader.list(crawler)
process = CrawlerProcess(settings)

def run_all_spiders():
   for spider in spiders:
     process.crawl(spider)

   process.start()

run_all_spiders()

But this seems to try running all the spiders at once the minute process.start() is called and stops processing further code until completed. Using only python, is there a way to enqueue many spiders so a certain number run at once, still having control over them?

Scrapyd hasn't seen any improvements in the past five years, which is around when the company behind Scrapy launched their paid crawler management system that includes many of the features you're looking for. You'll have to build something yourself with Celery. I was using a simple system built with Celery and [Flower](https://github.com/mher/flower) for a while but don't have the code for it anymore. There's also [django-dynamic-scraper](https://django-dynamic-scraper.readthedocs.io/en/latest/index.html), which you might find useful. — Blender, Feb 06 '18 at 23:35

Queueing and monitoring concurrent spiders with CrawlProcess

0 Answers0