4

I built a crawler using the python scrapy library. It works perfectly and reliably when running locally. I have attempted to port it over to the AWS lambda (I have packaged it appropriately). However when I run it the process isn't blocked whilst the crawl runs and instead completes before the crawlers can return giving no results. These are the last lines I get out of logs before it exits:

2018-09-12 18:58:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-09-12 18:58:07 [scrapy.core.engine] INFO: Spider opened

Whereas normally I would get a whole of information about the pages being crawled. I've tried sleeping after starting the crawl, installing crochet and adding its declarators and installing and using this specific framework that sounds like it addresses this problem but it also doesn't work.

I'm sure this is an issue with Lambda not respecting scrapys blocking, but I have no idea on how to address it.

2 Answers2

8

I had the same problem and fixed it by creating empty modules for sqlite3, as described in this answer: https://stackoverflow.com/a/44532317/5441099. Appearently, Scrapy imports sqlite3, but doesn't necessarily use it. Python3 expects sqlite3 to be on the host machine, but the AWS Lambda machines don't have it. The error message doesn't always show up in the logs.

Which means you can make it work by switching to Python2, or creating empty modules for sqlite3 like I did.

My entry file for running the crawler is as follows, and it works on Lambda with Python3.6:

# run_crawler.py
# crawl() is invoked from the handler function in Lambda
import os
from my_scraper.spiders.my_spider import MySpider
from scrapy.crawler import CrawlerProcess
# Start sqlite3 fix
import imp
import sys
sys.modules["sqlite"] = imp.new_module("sqlite")
sys.modules["sqlite3.dbapi2"] = imp.new_module("sqlite.dbapi2")
# End sqlite3 fix


def crawl():
    process = CrawlerProcess(dict(
        FEED_FORMAT='json',
        FEED_URI='s3://my-bucket/my_scraper_feed/' +
        '%(name)s-%(time)s.json',
        AWS_ACCESS_KEY_ID=os.getenv('AWS_ACCESS_KEY_ID'),
        AWS_SECRET_ACCESS_KEY=os.getenv('AWS_SECRET_ACCESS_KEY'),
    ))
    process.crawl(MySpider)
    process.start()  # the script will block here until all crawling jobs are finished


if __name__ == '__main__':
    crawl()
Viktor Andersen
  • 174
  • 1
  • 5
  • I really want to test this but am running into a number of unrelated issues. I ended up building a container and using fargate. I'll just trust that this fixes it. – The Empire Strikes Back Nov 23 '18 at 09:46
1

As @viktorAndersen's answers solves the issue of scrapy crashing/ working not as expected in AWS Lambda.

I had a heavy Spider crawling 2000 urls and I faced 2 problems

  1. ReactorNotRestartable error when I ran scrapy function more than 1 time. For the first time it working fine, but from second invokation I ran into the ReactorNotRestartable.

  2. Having timeout exception from crochet.wait_for() when spider is taking longer than the expected duration

This post is inspired from https://stackoverflow.com/a/57347964/12951298

import sys
import imp
from scrapy.crawler import  CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from twisted.internet import reactor;

from crochet import setup, wait_for


setup()

sys.modules["sqlite"] = imp.new_module("sqlite")
sys.modules["sqlite3.dbapi2"] = imp.new_module("sqlite.dbapi2")


@wait_for(900)
def crawl():
    '''
    wait_for(Timeout = inseconds)
    change the timeout accordingly
    this function will raise crochet.TimeoutError if more than 900
    seconds elapse without an answer being received
    '''
    spider_name="header_spider" #your spider name
    project_settings = get_project_settings()
    spider_loader = SpiderLoader(project_settings)

    spider_cls = spider_loader.load(spider_name)
    configure_logging()
    process = CrawlerRunner({**project_settings});
    d = process.crawl(spider_cls);
    return d;

if __name__ == "__main__":
    main('', '')