ReactorNotRestartable with scrapy when using Google Cloud Functions

Question

I am trying to send multiple crawl requests with Google Cloud Functions. However, I seem to be getting the ReactorNotRestartable error. From other posts on StackOverflow, such as this one, I understand that this comes because it is not possible to restart the reactor, in particular when doing a loop.

The way to solve this is by putting the start() outside the for loop. However, with Cloud Functions this is not possible as each request should be technically independent.

Is the CrawlerProcess somehow cached with Cloud Functions? And if so, how can we remove this behaviour.

I tried for instance to put the import and initialization process inside a function, instead of outside, to prevent the caching of imports, but that did not work:

# main.py

def run_single_crawl(data, context):
    from scrapy.crawler import CrawlerProcess
    process = CrawlerProcess()

    process.crawl(MySpider)
    process.start()

Dustin Ingram · Accepted Answer · 2020-04-09T16:43:14.950

By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.

Instead, we can use scrapydo to run your existing spider in a blocking fashion:

requirements.txt:

scrapydo

main.py:

import scrapy
import scrapydo

scrapydo.setup()


class MyItem(scrapy.Item):
    url = scrapy.Field()


class MySpider(scrapy.Spider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = ["http://example.com/"]

    def parse(self, response):
        yield MyItem(url=response.url)


def run_single_crawl(data, context):
    results = scrapydo.run_spider(MySpider)

This also shows a simple example of how to yield one or more scrapy.Item from the spider and collect the results from the crawl, which would also be challenging to do if not using scrapydo.

Also: make sure that you have billing enabled for your project. By default Cloud Functions cannot make outbound requests, and the crawler will succeed, but return no results.

Hallelujah this seems to be working. I will accept your answer if all goes well with my test and time passed as this question has a bounty, so cannot accept it yet. But wow, I have no idea what this does but what a relief it would work. — WJA, Apr 09 '20 at 16:47
I really like this solution as well, as I had the exact same issue in Azure Functions. It does work very well, thanks! — Sebastien Mornas, Dec 17 '21 at 10:27

score 0 · Answer 2 · answered Aug 23 '20 at 21:00

0

You can simply crawl the spider in a sequence.

main.py

from scrapy.crawler import CrawlerProcess
def run_single_crawl(data, context):
    process = CrawlerProcess()

    process.crawl(MySpider1)
    process.crawl(MySpider2)
    process.start()

answered Aug 23 '20 at 21:00

Aayush Singla

11
1

ReactorNotRestartable with scrapy when using Google Cloud Functions

2 Answers2

main.py

Linked