I've got a certain spider which inherits from SitemapSpider
. As expected, the first request on startup is to sitemap.xml
of my website. However, for it to work correctly I need to add a header to all the requests, including the initial ones which fetch the sitemap. I do so with DownloaderMiddleware, like this:
def process_request(self, request: scrapy.http.Request, spider):
if "Host" in request.headers:
return None
host = request.url.removeprefix("https://").removeprefix("http://").split("/")[0]
request.headers["Host"] = host
spider.logger.info(f"Got {request}")
return request
However, looks like Scrapy's request deduplicator is stopping this request from going through. In my logs I see something like this:
2021-10-16 21:21:08 [ficbook-spider] INFO: Got <GET https://mywebsite.com/sitemap.xml>
2021-10-16 21:21:08 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://mywebsite.com/sitemap.xml>
Since spider.logger.info
in process_request
is triggered only once, I presume that it is the first request, and, after processing, it gets deduplicated. I thought that, maybe, deduplication is triggered before DownloaderMiddleware (that would explain that the request is deduplicated without a second "Got ..." in logs), however, I don't think that's true for two reasons:
- I looked through the code of SitemapSpider, and it appears to fetch the sitemap.xml only once
- If it did, in fact, fetch it before, I'd expect it to do something - instead it just stops the spider, since no pages were enqueued for processing
Why does this happen? Did I make some mistake in process_request
?