This is less of a "how do I use these?" and more of "when/why would I use these?" type question.
EDIT: This question is a near duplicate of this question, which suggests the use a Download Middleware to filter such requests. Updated my question below to reflect that.
In the Scrapy CrawlSpider documentation, rules accept two callables, process_links
and process_request
(documentation quoted below for easier reference).
By default Scrapy is filtering duplicated URLs, but I'm looking to do additional filtering of requests because I get duplicates of pages that have multiple distinct URLs linking to them. Things like,
URL1 = "http://example.com/somePage.php?id=XYZ&otherParam=fluffyKittens"
URL2 = "http://example.com/somePage.php?id=XYZ&otherParam=scruffyPuppies"
However, these URLs will have a similar element in the query string - shown above it is the id
.
I'm thinking it would make sense to use the process_links
callable of my spider to filter out duplicate requests.
Questions:
- Is there some reason why
process_request
would be better suite to this task? - If not, can you provide an example of when
process_request
would be more applicable? - Is a download middleware more appropriate than either
process_links
orprocess_request
? If so, can you provide an example of whenprocess_links
orprocess_request
would be a better solution?
Documentation quote:
process_links is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specified link_extractor. This is mainly used for filtering purposes.
process_request is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called with every request extracted by this rule, and must return a request or None (to filter out the request).