I need to run parallel spiders, one spider per domain that I am crawling, and so I am passing-in a single start_url to each process using the approach described here: https://stackoverflow.com/questions/57638131/how-to-run-scrapy-crawler-process-parallel-in-separate-processes-multiprocessi
However, as each crawlspider needs to remain locally on its own domain, I need to use 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware' to prevent off-site crawling. That Middleware seems to make use of self.allowed_domains to control what counts as 'local', but that can't be set until my 'start_requests' function, as it's only then that the start_url and it's associated allowed_domain are available in the passed-in settings. Here's my start_request function:
def start_requests(self):
self.allowed_domains=self.settings['crawl_start'][1]
yield scrapy.Request(self.settings['crawl_start'][0])
Unfortunately, the middleware seems to ignore the allowed_domains setting at this point, perhaps because it's too late in the pipeline.
I've tried many approaches, but I have seem to have a race-condition because the passed-in start_url/allowed_domains values aren't available when they need to be set, which seems to be earlier in the pipeline, before that function runs, so that the Offsite Middleware can pick it up.
Help would be appreciated. Thanks.