4

I have deployed a scrapy project which crawls whenever an lambda api requests comes.

It runs perfectly for the first api call but later on it fails and throws ReactorNotRestartable error.

As far as I can understand the AWS Lambda ecosystem is not killing the process, hence reactor is still present in the memory.

The lambda log error is as follows:

Traceback (most recent call last):
File "/var/task/aws-lambda.py", line 42, in run_company_details_scrapy
process.start()
File "./lib/scrapy/crawler.py", line 280, in start
reactor.run(installSignalHandlers=False)  # blocking call
File "./lib/twisted/internet/base.py", line 1242, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "./lib/twisted/internet/base.py", line 1222, in startRunning
ReactorBase.startRunning(self)
File "./lib/twisted/internet/base.py", line 730, in startRunning
raise error.ReactorNotRestartable()
ReactorNotRestartable

The lambda handler function is:

def run_company_details_scrapy(event, context):
   process = CrawlerProcess()
   process.crawl(CompanyDetailsSpidySpider)
   process.start()

I had a workaround by not stopping the reactor by inserting a flag in the start function

process.start(stop_after_crawl=False)

But the problem with this was that I had to wait until the lambda call timed out.

Tried other solutions, but none of them seems to work.Can anyone guide me how to solve this problem.

firefox
  • 43
  • 1
  • 3
  • Huh. Lambda re-uses your Python process to handle multiple events? And your handler has to complete synchronously? – Jean-Paul Calderone Feb 22 '17 at 16:37
  • @firefox Since you marked the question as solved. Can you describe how you used crochet to solve your issue ? – Hugo Apr 14 '17 at 07:08
  • @firefox Im having a hard time trying to run scrapy in an aws lambda.. how did you create your zip file? I'm having an `ImportError: cannot import name 'etree'` – Emannuel Carvalho May 31 '17 at 21:35

5 Answers5

4

Had the same problem recently, and Crochet didn't want to work for various reasons.

Eventually we went for the dirty solution: just call sys.exit(0) (or sys.exit(1) if an error was caught, not that anything looks at the return code AFAICT) at the end of the lambda handler function. This worked perfectly.

Obviously no good if you're intending to return a response from your Lambda, but if you're using Scrapy, data's probably being persisted already via your Pipelines, with a scheduler as the trigger for your Lambda, so no response needed.

Note: you will get a notice from AWS in CloudWatch:

RequestId: xxxx Process exited before completing request 
declension
  • 4,110
  • 22
  • 25
2

This problem isn't unique to AWS Lambda - see running a spider in a Celery task.

You might try ScrapyScript (disclosure: I wrote it). It spawns a subprocess to support the Twisted reactor, blocks until all of the supplied spiders have finished, and then exits. It was written with Celery in mind, but the use case is similar.

In your case, this should work:

from scrapyscript import Job, Processor
def run_company_details_scrapy(event, context):
    job = Job(CompanyDetailsSpidySpider())
    Processor().run(job)`
Community
  • 1
  • 1
jschnurr
  • 1,181
  • 6
  • 8
  • Careless combination of Twisted and multi-processing results in bizarre hard-to-debug errors. For example - http://stackoverflow.com/questions/42347121/why-i-am-getting-sigchldwaker-object-has-no-attribute-dowrite-in-scrapy - does ScrapyScript address such things? – Jean-Paul Calderone Feb 23 '17 at 13:07
  • @jschnurr I have a different error using scrapy on lambda "module initialization error: 'twisted.internet.reactor'" I tried scrapyscript without success. Do you have any thoughts on the subject? – Hugo Apr 14 '17 at 08:22
2

I faced error ReactorNotRestartable on AWS lambda and after I came to this solution

By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.

Instead, we can use scrapydo to run your existing spider in a blocking fashion:

import scrapy
import scrapy.crawler as crawler
rom scrapy.spiders import CrawlSpider
import scrapydo

scrapydo.setup()

# your spider
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            print(quote.css('span.text::text').extract_first())

scrapydo.run_spider(QuotesSpider)
1

You could try using https://pypi.python.org/pypi/crochet to coordinate use of a reactor running in a non-main thread from the Lambda-controlled main thread.

Crochet will do the threaded reactor initialization for you and provides tools to make it easy to call code in the reactor thread from the main (and get the results).

This might be more in line with the expectations Lambda has of your code.

Jean-Paul Calderone
  • 47,755
  • 6
  • 94
  • 122
0

Try this! It works for me!

Use Crochet to setup

import json
import logging
import os
import threading
import boto3
import scrapy
from fake_useragent import UserAgent
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from crochet import setup

# Initialize Crochet
setup()

# Configure logging
logging.getLogger('scrapy').propagate = False
ua = UserAgent()
region = os.getenv('REGION')
sqs = boto3.client('sqs', region_name=region)
queue_links = os.getenv('queue_links')


class MySpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["www.example.com.tw"]
    start_urls = ["https://www.example.com.tw/main/Main.jsp"]
    user_agent = ua.random
    batch_size = 10  # batch size

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.links = []

    def parse(self, response, **kwargs):
        try:
            sub_menus = response.css('.subMenu')
            for sub_menu in sub_menus:
                sub_menu_links = sub_menu.css("#topArea .dul .BTDME a::attr(href)")
                for sub_menu_link in sub_menu_links:
                    link = str(sub_menu_link.get())
                    if link.find("https") != -1 and link.find("category") != -1:
                        try:
                            self.links.append(link)

                        except Exception as e:
                            logging.error(f'SeleniumRequest: error > {e}, link: {sub_menu_link.get()}')
        except Exception as e:
            logging.error(str(e))

class LambdaRunner:
    def __init__(self):
        self.finished = threading.Event()
        self.results = []

    def run_spider(self):
        # Create a CrawlerRunner with project settings
        settings = get_project_settings()
        runner = CrawlerRunner(settings)

        # Create an instance of the spider class
        spider_cls = MySpider

        # Callback function to handle the spider results
        def handle_results(result):
            self.results.append(result)

            # Check if the spider has finished running
            if len(self.results) == 1:
                self.finished.set()

        # Start the first spider run
        deferred = runner.crawl(spider_cls)
        deferred.addCallback(handle_results)

        # Start the reactor
        runner.join()

    def wait_for_completion(self):
        self.finished.wait()

    def get_results(self):
        return self.results


def handler(event, context):
    try:
        runner = LambdaRunner()
        runner.run_spider()
        runner.wait_for_completion()

        return {
            'statusCode': 200,
            'body': json.dumps({'message': 'Completed!'})
        }
    except Exception as e:
        logging.exception(e)
        return {
            'statusCode': 500,
            'body': json.dumps({'error': str(e)})
        }

And you have to remove this setting in your settings.py of scrapy

# TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
Wise Lin
  • 11
  • 2