How to stop the reactor after the run of multiple spiders in the same process on scrapy?

Question

I have several different spiders and want to run all them at once. Based on this and this, I can run multiple spiders in the same process. However, I don't know how to design a signal system to stop the reactor when all spiders are finished.

I have tried:

crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

and

crawler.signals.connect(reactor.stop, signal=signals.spider_idle)

In both cases, the reactor stops when first crawler closes. Of course, I want that the reactor stops after all spiders are finished.

Could someone show me how to do the trick?

I would say that, in my case, to use `scrapyd` is "to kill a ant with cannonball". I just need to run a bunch of spiders together. `scrapyd` do a lot more than that. `scrapyd` adds a layer of software that I don't need. — , Apr 04 '14 at 00:04

score 7 · Accepted Answer · 2014-10-01T22:26:07.897

7

After a sleep night, I have realized I know how to do that. All I need is a counter:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings

class ReactorControl:

    def __init__(self):
        self.crawlers_running = 0

    def add_crawler(self):
        self.crawlers_running += 1

    def remove_crawler(self):
        self.crawlers_running -= 1
        if self.crawlers_running == 0 :
            reactor.stop()

def setup_crawler(spider_name):
    crawler = Crawler(settings)
    crawler.configure()
    crawler.signals.connect(reactor_control.remove_crawler, signal=signals.spider_closed)
    spider = crawler.spiders.create(spider_name)
    crawler.crawl(spider)
    reactor_control.add_crawler()
    crawler.start()

reactor_control = ReactorControl()
log.start()
settings = get_project_settings()
crawler = Crawler(settings)

for spider_name in crawler.spiders.list():
    setup_crawler(spider_name)

reactor.run()

I am assuming Scrapy is not parallel.

I don't know if it is the best way to do that, but it works!

Edit: Updated. See @Jean-Robert comment.

edited Oct 01 '14 at 22:26

answered Apr 04 '14 at 22:44

2

Nice solution! One thing though: because of scrapy's asynchrone behavior, you might end up with a `spider_closed` signal being triggered before you've added the crawler to `ReactorControl` (maybe if there is an exception very early in the process), in which case the count is wrong. Probably moving `add_crawler` up one line would do. But that's an extreme case... – Jean-Robert Sep 30 '14 at 15:46
If I start 3 spiders, how do they proceed in the same process? What do they share? – Shuai Zhang Jan 29 '15 at 14:06
Just a guess... You can consider each spider as a independent function and any of the spiders will interfere in the others two. Again, I have assumed that Scrapy runs as a single thread. So, there are just one process and one thread at any time. (But I can be completely wrong.) – Jan 30 '15 at 20:10
I doesn't work properly when using JOBDIR, because spider resources are not properly released. – Toilal May 02 '15 at 13:19

How to stop the reactor after the run of multiple spiders in the same process on scrapy?

1 Answers1