I'm trying to run Scrapy on an AWS Lambda function and everything is almost working, except that I need to run 2 lambdas in the 1 function. The main catch is that I need the 2 spiders to output to 2 different JSON files.
The docs look like they've got a very close solution:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()
Except for the fact that if I were to input my settings into the CrawlerProcess
like I currently have:
CrawlerProcess({
'FEED_FORMAT': 'json',
'FEED_URI': '/tmp/fx_today_data.json'
})
Then both spiders would output to the one file fx_today_data.json
.
I've tried creating 2 CrawlerProcesses but that gives me the ReactorNotRestartable
Error which I've tried solving using this thread but with no success.
I've also tried running the scrapy code like so:
subprocess.call(["scrapy", "runspider", "./spiders/fx_today_data.py", "-o", "/tmp/fx_today_data.json"])
But this results in the usual 'scrapy' command not found - because I don't have a virtualenv set up in the Lambda function (I don't know if it's worth setting one up for this?).
Does anyone know how to run 2 Scrapy Spiders (they don't have to run at the same time) in one process and have them output to separate files?