I am trying to send multiple crawl requests with Google Cloud Functions. However, I seem to be getting the ReactorNotRestartable
error. From other posts on StackOverflow, such as this one, I understand that this comes because it is not possible to restart the reactor, in particular when doing a loop.
The way to solve this is by putting the start()
outside the for loop. However, with Cloud Functions this is not possible as each request should be technically independent.
Is the CrawlerProcess
somehow cached with Cloud Functions? And if so, how can we remove this behaviour.
I tried for instance to put the import and initialization process inside a function, instead of outside, to prevent the caching of imports, but that did not work:
# main.py
def run_single_crawl(data, context):
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(MySpider)
process.start()