0

I need to run parallel spiders, one spider per domain that I am crawling, and so I am passing-in a single start_url to each process using the approach described here: https://stackoverflow.com/questions/57638131/how-to-run-scrapy-crawler-process-parallel-in-separate-processes-multiprocessi

However, as each crawlspider needs to remain locally on its own domain, I need to use 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware' to prevent off-site crawling. That Middleware seems to make use of self.allowed_domains to control what counts as 'local', but that can't be set until my 'start_requests' function, as it's only then that the start_url and it's associated allowed_domain are available in the passed-in settings. Here's my start_request function:

def start_requests(self):
        self.allowed_domains=self.settings['crawl_start'][1]
        yield scrapy.Request(self.settings['crawl_start'][0])

Unfortunately, the middleware seems to ignore the allowed_domains setting at this point, perhaps because it's too late in the pipeline.

I've tried many approaches, but I have seem to have a race-condition because the passed-in start_url/allowed_domains values aren't available when they need to be set, which seems to be earlier in the pipeline, before that function runs, so that the Offsite Middleware can pick it up.

Help would be appreciated. Thanks.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
Stevod
  • 15
  • 1
  • 6

1 Answers1

0

I fixed this by overriding the _filter function in the standard offsite middleware (in class 'scrapy.spidermiddleware.offsite.OffsiteMiddleware': None,) to rebuilt the host regex each time using 'get_host_regex' as follows:

  def _filter(self, request, spider) -> bool:
        if not isinstance(request, Request):
            return True
        self.host_regex = self.get_host_regex(spider) # NOTE: added this line to rebuild host regex at this point to pick up the latest 'allowed_domains' settings
        if request.dont_filter or self.should_follow(request, spider):
            return True
        domain = urlparse_cached(request).hostname
        if domain and domain not in self.domains_seen:
            self.domains_seen.add(domain)
            logger.debug(
                "Filtered offsite request to %(domain)r: %(request)s",
                {"domain": domain, "request": request},
                extra={"spider": spider},
            )
            self.stats.inc_value("offsite/domains", spider=spider)
        self.stats.inc_value("offsite/filtered", spider=spider)
        return False 
Stevod
  • 15
  • 1
  • 6