Scrapy: CrawlSpider Rules process_links vs process_request vs download middleware

Question

This is less of a "how do I use these?" and more of "when/why would I use these?" type question.

EDIT: This question is a near duplicate of this question, which suggests the use a Download Middleware to filter such requests. Updated my question below to reflect that.

In the Scrapy CrawlSpider documentation, rules accept two callables, process_links and process_request (documentation quoted below for easier reference).

By default Scrapy is filtering duplicated URLs, but I'm looking to do additional filtering of requests because I get duplicates of pages that have multiple distinct URLs linking to them. Things like,

URL1 = "http://example.com/somePage.php?id=XYZ&otherParam=fluffyKittens"
URL2 = "http://example.com/somePage.php?id=XYZ&otherParam=scruffyPuppies"

However, these URLs will have a similar element in the query string - shown above it is the id.

I'm thinking it would make sense to use the process_links callable of my spider to filter out duplicate requests.

Questions:

Is there some reason why process_request would be better suite to this task?
If not, can you provide an example of when process_request would be more applicable?
Is a download middleware more appropriate than either process_links or process_request? If so, can you provide an example of when process_links or process_request would be a better solution?

Documentation quote:

process_links is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specified link_extractor. This is mainly used for filtering purposes.

process_request is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called with every request extracted by this rule, and must return a request or None (to filter out the request).

score 9 · Accepted Answer · answered Apr 16 '13 at 15:07

No, process_links is your better option here are as you are just filtering urls and will save the overhead of having to create the Request in process_request just to discard it.
process_request is useful if you want to massage the Request a little before you send it off, say if you want to add a meta argument or perhaps add or remove headers.
you don't need any middleware in your case because the functionality you need is built directly into the Rule. If process_links were not built into the rules, then you would need to create your own middleware.

Thank you for the descriptive answer, it is much appreciated! — CatShoes, Apr 16 '13 at 16:31

Scrapy: CrawlSpider Rules process_links vs process_request vs download middleware

1 Answers1