Scrapy run 2 spiders with outputs to 2 different files using one process (AWS Lambda)

Question

I'm trying to run Scrapy on an AWS Lambda function and everything is almost working, except that I need to run 2 lambdas in the 1 function. The main catch is that I need the 2 spiders to output to 2 different JSON files.

The docs look like they've got a very close solution:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()

Except for the fact that if I were to input my settings into the CrawlerProcess like I currently have:

CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': '/tmp/fx_today_data.json'
})

Then both spiders would output to the one file fx_today_data.json.

I've tried creating 2 CrawlerProcesses but that gives me the ReactorNotRestartable Error which I've tried solving using this thread but with no success.

I've also tried running the scrapy code like so:

subprocess.call(["scrapy", "runspider", "./spiders/fx_today_data.py", "-o", "/tmp/fx_today_data.json"])

But this results in the usual 'scrapy' command not found - because I don't have a virtualenv set up in the Lambda function (I don't know if it's worth setting one up for this?).

Does anyone know how to run 2 Scrapy Spiders (they don't have to run at the same time) in one process and have them output to separate files?

Some solutions I didn't test : custom settings per spider (https://stackoverflow.com/questions/41871605/how-to-pass-arguments-for-feed-uri-to-scrapy-spiders-instane-for-dynamically) / create 2 lambdas (would be the best choice imo, not much expensive) / add to your items an attribute with the name of the spider, then split json file later + export directly those items on some DocumentDB or MongoDB instance. — Corentin Limier, Jan 31 '20 at 09:58

Jamie · Accepted Answer · 2020-01-31T16:18:57.340

Thanks to Corentin and this guide, I was able to get it working.

By individually creating a custom_settings class attribute for the spiders I could run them off the one CrawlerProcess and not have to worry as they individually had their own file outputs.

The final code looks a lot like the docs example I provided in the question.

I also ended up having to use from multiprocessing.context import Process and to use a try block to terminate the process (before it's even been assigned!) in order to make sure that I would avoid the ReactorNotRestartable error.

Scrapy run 2 spiders with outputs to 2 different files using one process (AWS Lambda)

1 Answers1