3

I deployed my web-crawler to the AWS Lambda. Then While testing, it ran correctly for the first time, but the second time it gave this error. raise error.reactornotrestartable() twisted.internet.error.reactornotrestartable in AWS lambda

File "/var/task/main.py", line 19, in run_spider
    reactor.run()
  File "/var/task/twisted/internet/base.py", line 1282, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/var/task/twisted/internet/base.py", line 1262, in startRunning
    ReactorBase.startRunning(self)
  File "/var/task/twisted/internet/base.py", line 765, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

The crawler worked fine on my local python environment. The function I am trying to run inside main.py is this

def run_spider(event, s):
    given_links = []
    print(given_links)
    for t in event["Records"]:
        given_links.append(t["body"])
    runner = CrawlerRunner(s)
    deferred = runner.crawl('spider', crawl_links=given_links)
    deferred.addCallback(lambda _: reactor.stop())
    reactor.run()

def lambda_handler(event, context=None):
    s = get_project_settings()
    s['FEED_FORMAT'] = 'csv'
    s['FEED_URI'] = '/tmp/output.csv'
    run_spider(event, s)

where the event looks like this:

{
  "Records": [
    {
      "body": "https://example.com"
    }
  ]
}

Initially, I was using CrawlerProcess instead of CrawlerRunner, but it also gave the same error. Then after looking through some of the answers on StackOverflow, I changed my code to use CrawlerRunner. Some people also suggested using Crochet, I tried that and got this error:

ValueError: signal only works in main thread in scrapy

What can I do to resolve this error?

  • May this be a duplicate of https://stackoverflow.com/q/42388541/939364 ? – Gallaecio May 19 '20 at 15:13
  • Maybe, you are right, but none of the solutions are working for me on that link. The sys.exit() answer is working finally but I wanted something less dirty. So, I thought it would be better to ask the question again? I am sorry if it's wrong, I am just new to this StackOverflow stuff. – vaibhav mittal May 20 '20 at 06:34
  • I would personally just vote up the original question and share it in [Reddit](https://www.reddit.com/r/scrapy/), asking if anyone can think of a cleaner method. – Gallaecio May 22 '20 at 10:16

1 Answers1

0

I faced error ReactorNotRestartable on AWS lambda and after I came to this solution

By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.

Instead, we can use scrapydo to run your existing spider in a blocking fashion:

import scrapy
import scrapy.crawler as crawler
rom scrapy.spiders import CrawlSpider
import scrapydo

scrapydo.setup()

# your spider
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            print(quote.css('span.text::text').extract_first())

scrapydo.run_spider(QuotesSpider)