There are several posts regarding how to setup Scrapy within a Celery task avoiding to restart the Twister reactor to prevent the twisted.internet.error.ReactorNotRestartable
error. I have tried using CrawlerRunner as recommended on docs and crochet but to make it work the following lines have to be removed from the code:
d.addBoth(lambda _: reactor.stop())
reactor.run() # Script blocks here until spider finishes.
This is the full code:
from scrapy.crawler import CrawlerRunner
@app.task(bind=True, name="run_spider")
def run_spider(self, my_spider_class, my_spider_settings):
from crochet import setup
setup() # Fixes the issue with Twisted reactor.
runner = CrawlerRunner(my_spider_settings)
crawler = runner.create_crawler(my_spider_class)
runner.crawl(crawler)
d = runner.join()
def spider_finished(_): <-- This function is called when spider finishes.
logger.info("{} finished:\n{}".format(
spider_name,
json.dumps(
crawler.stats.spider_stats.get(spider_name, {}),
cls=DjangoJSONEncoder,
indent=4
)
))
d.addBoth(spider_finished)
return f"{spider_name} started" <-- How to block execution until spider finishes?
Now works but the Spider seems to be running detached from the task, so I get the return response before the spider is finished. I have used a workaround with the callback spider_finished()
but is not ideal because the celery worker keeps running an executing other tasks and eventually kills the process affecting the detached spiders.
Is there a way to block the execution of the task until the Scrapy spider is done?