5

I'm currently trying to get scrapy to run in Google Cloud Function.

from flask import escape
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

def hello_http(request):
    settings = get_project_settings()

    process = CrawlerProcess(settings)
    process.crawl(BlogSpider)
    process.start()

    return 'Hello {}!'.format(escape("Word"))

This works, but strangely enough, not "all the time". Every other time, the HTTP call will return an error, then I can read on stack driver: Function execution took 509 ms, finished with status: 'crash'

I check the spider, even simplified it to something that can't fail such as:

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        yield { 'id': 1 }

Can someone explain to me what's going on?

Could it be a resource quota I'm hitting?

enter image description here

TKrugg
  • 2,255
  • 16
  • 18
  • I'm not seeing how your first code block is related to the second code block. Where is `BlogSpider` being used by the Cloud Function? – Dustin Ingram Mar 21 '20 at 16:43
  • yes sorry, it was just a typo. I fixed it. – TKrugg Mar 21 '20 at 22:39
  • 3
    I've been trying to reproduce your issue, and as far as I can see, the first time the functions run, everything works as expected. After executing the function one more time, I can see the following error: twisted.internet.error.ReactorNotRestartable. As far as my knowledge is concerned, seems to be this is not working because there is only one reactor per process and you cannot start it twice. I find you can try to implement something like [this](https://botproxy.net/docs/how-to/scrapy-how-to-run-spider-from-other-python-script-twice-or-more/). I hope it helps – Christopher Rodriguez Conde Mar 23 '20 at 10:16
  • 1
    I also find this stackoverflow posts that could maybe help: [post-1](https://stackoverflow.com/questions/48913525/scrapy-raises-reactornotrestartable-when-crawlerprocess-is-ran-twice) and [post-2](https://stackoverflow.com/questions/39946632/reactornotrestartable-error-in-while-loop-with-scrapy). According to the [documentation](https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process), Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running multiple spiders per process using the internal API. – Christopher Rodriguez Conde Mar 23 '20 at 10:18
  • Did anyone find a solution to this? Related question: https://stackoverflow.com/questions/61083449/reactornotrestartable-with-scrapy-when-using-google-cloud-functions – WJA Apr 09 '20 at 15:58
  • 1
    I ended wrapping my spider in its own process. @ChristopherRodriguezConde 's solution put me on the right track. I wrote about it here: https://weautomate.org/articles/running-scrapy-spider-cloud-function/ – TKrugg Apr 10 '20 at 07:40

2 Answers2

4

This is because your Cloud Function is not idempotent. Under the hood, Scrapy is using Twisted which sets up many global objects to power its events driven machinery.

https://weautomate.org/articles/running-scrapy-spider-cloud-function/

0

The code seems fine. One probable problem might be your settings file is in your subdirectory of your main.py file. Load the file with proper folder path

    settings_file_path = '<main_folder>.<sub_folder>.settings'  
    os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
    settings = get_project_settings()
    settings.setdict({
        'LOG_LEVEL': 'ERROR',
        'LOG_ENABLED': True,
    })
    process = CrawlerProcess(settings)
    process.crawl(BlogSpider)
    process.start()
    return 'OK'

This code seems to work with my google cloud function.