Running a Scrapy spider in Google Cloud Function

Question

I'm currently trying to get scrapy to run in Google Cloud Function.

from flask import escape
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

def hello_http(request):
    settings = get_project_settings()

    process = CrawlerProcess(settings)
    process.crawl(BlogSpider)
    process.start()

    return 'Hello {}!'.format(escape("Word"))

This works, but strangely enough, not "all the time". Every other time, the HTTP call will return an error, then I can read on stack driver: Function execution took 509 ms, finished with status: 'crash'

I check the spider, even simplified it to something that can't fail such as:

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        yield { 'id': 1 }

Can someone explain to me what's going on?

Could it be a resource quota I'm hitting?

I'm not seeing how your first code block is related to the second code block. Where is `BlogSpider` being used by the Cloud Function? — Dustin Ingram, Mar 21 '20 at 16:43
I've been trying to reproduce your issue, and as far as I can see, the first time the functions run, everything works as expected. After executing the function one more time, I can see the following error: twisted.internet.error.ReactorNotRestartable. As far as my knowledge is concerned, seems to be this is not working because there is only one reactor per process and you cannot start it twice. I find you can try to implement something like [this](https://botproxy.net/docs/how-to/scrapy-how-to-run-spider-from-other-python-script-twice-or-more/). I hope it helps — Christopher Rodriguez Conde, Mar 23 '20 at 10:16
I also find this stackoverflow posts that could maybe help: [post-1](https://stackoverflow.com/questions/48913525/scrapy-raises-reactornotrestartable-when-crawlerprocess-is-ran-twice) and [post-2](https://stackoverflow.com/questions/39946632/reactornotrestartable-error-in-while-loop-with-scrapy). According to the [documentation](https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process), Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running multiple spiders per process using the internal API. — Christopher Rodriguez Conde, Mar 23 '20 at 10:18
Did anyone find a solution to this? Related question: https://stackoverflow.com/questions/61083449/reactornotrestartable-with-scrapy-when-using-google-cloud-functions — WJA, Apr 09 '20 at 15:58
I ended wrapping my spider in its own process. @ChristopherRodriguezConde 's solution put me on the right track. I wrote about it here: https://weautomate.org/articles/running-scrapy-spider-cloud-function/ — TKrugg, Apr 10 '20 at 07:40

score 4 · Accepted Answer · answered Nov 19 '21 at 16:37

4

This is because your Cloud Function is not idempotent. Under the hood, Scrapy is using Twisted which sets up many global objects to power its events driven machinery.

https://weautomate.org/articles/running-scrapy-spider-cloud-function/

answered Nov 19 '21 at 16:37

Byamba Enkhbat

56
4

score 0 · Answer 2 · answered Aug 23 '20 at 18:35

The code seems fine. One probable problem might be your settings file is in your subdirectory of your main.py file. Load the file with proper folder path

    settings_file_path = '<main_folder>.<sub_folder>.settings'  
    os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
    settings = get_project_settings()
    settings.setdict({
        'LOG_LEVEL': 'ERROR',
        'LOG_ENABLED': True,
    })
    process = CrawlerProcess(settings)
    process.crawl(BlogSpider)
    process.start()
    return 'OK'

This code seems to work with my google cloud function.

Running a Scrapy spider in Google Cloud Function

2 Answers2