1

There are several posts regarding how to setup Scrapy within a Celery task avoiding to restart the Twister reactor to prevent the twisted.internet.error.ReactorNotRestartable error. I have tried using CrawlerRunner as recommended on docs and crochet but to make it work the following lines have to be removed from the code:

d.addBoth(lambda _: reactor.stop())
reactor.run() # Script blocks here until spider finishes.

This is the full code:

from scrapy.crawler import CrawlerRunner

@app.task(bind=True, name="run_spider")
def run_spider(self, my_spider_class, my_spider_settings):
    from crochet import setup
    setup()  # Fixes the issue with Twisted reactor.
    runner = CrawlerRunner(my_spider_settings)
    crawler = runner.create_crawler(my_spider_class)
    runner.crawl(crawler)
    d = runner.join()

    def spider_finished(_):   <-- This function is called when spider finishes.
        logger.info("{} finished:\n{}".format(
            spider_name,
            json.dumps(
                crawler.stats.spider_stats.get(spider_name, {}),
                cls=DjangoJSONEncoder,
                indent=4
            )
        ))

    d.addBoth(spider_finished)
    return f"{spider_name} started" <-- How to block execution until spider finishes?

Now works but the Spider seems to be running detached from the task, so I get the return response before the spider is finished. I have used a workaround with the callback spider_finished() but is not ideal because the celery worker keeps running an executing other tasks and eventually kills the process affecting the detached spiders.

Is there a way to block the execution of the task until the Scrapy spider is done?

Ander
  • 5,093
  • 7
  • 41
  • 70

1 Answers1

0

From doc:

It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.

Celery does not use a reactor, so u cant use CrawlerRunner.

Run a Scrapy spider in a Celery Task

kjaw
  • 515
  • 2
  • 9