43

I am writing a crawler for a website using scrapy with CrawlSpider.

Scrapy provides an in-built duplicate-request filter which filters duplicate requests based on urls. Also, I can filter requests using rules member of CrawlSpider.

What I want to do is to filter requests like:

http:://www.abc.com/p/xyz.html?id=1234&refer=5678

If I have already visited

http:://www.abc.com/p/xyz.html?id=1234&refer=4567

NOTE: refer is a parameter that doesn't affect the response I get, so I don't care if the value of that parameter changes.

Now, if I have a set which accumulates all ids I could ignore it in my callback function parse_item (that's my callback function) to achieve this functionality.

But that would mean I am still at least fetching that page, when I don't need to.

So what is the way in which I can tell scrapy that it shouldn't send a particular request based on the url?

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
nik-v
  • 753
  • 1
  • 9
  • 20

5 Answers5

42

You can write custom middleware for duplicate removal and add it in settings

import os

from scrapy.dupefilter import RFPDupeFilter

class CustomFilter(RFPDupeFilter):
"""A dupe filter that considers specific ids in the url"""

    def __getid(self, url):
        mm = url.split("&refer")[0] #or something like that
        return mm

    def request_seen(self, request):
        fp = self.__getid(request.url)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

Then you need to set the correct DUPFILTER_CLASS in settings.py

DUPEFILTER_CLASS = 'scraper.duplicate_filter.CustomFilter'

It should work after that

vozman
  • 1,198
  • 1
  • 14
  • 19
ytomar
  • 558
  • 7
  • 10
  • I put your code in a file in the spider folder, but i got this error ` dupefilter = dupefilter_cls.from_settings(settings) exceptions.AttributeError: 'module' object has no attribute 'from_settin gs'` – William Kinaan Jan 18 '14 at 14:52
  • Thanks, This works, but how do do i access the `spider` object from my customfilter class? – wolfgang Sep 09 '15 at 13:12
10

Following ytomar's lead, I wrote this filter that filters based purely on URLs that have already been seen by checking an in-memory set. I'm a Python noob so let me know if I screwed something up, but it seems to work all right:

from scrapy.dupefilter import RFPDupeFilter

class SeenURLFilter(RFPDupeFilter):
    """A dupe filter that considers the URL"""

    def __init__(self, path=None):
        self.urls_seen = set()
        RFPDupeFilter.__init__(self, path)

    def request_seen(self, request):
        if request.url in self.urls_seen:
            return True
        else:
            self.urls_seen.add(request.url)

As ytomar mentioned, be sure to add the DUPEFILTER_CLASS constant to settings.py:

DUPEFILTER_CLASS = 'scraper.custom_filters.SeenURLFilter'
Abe Voelker
  • 30,124
  • 14
  • 81
  • 98
  • where should I put the file pelase? – William Kinaan Jan 26 '14 at 21:06
  • @WilliamKinaan `custom_filters.py` is where I put it, in the same directory as `settings.py`. However I ended up just using scrapy's default URL filter as it was good enough for me anyway. This was more of an exercise in learning how to write a custom filter. I haven't looked at the internal implementation, but have heard it uses a [bloom filter](http://en.wikipedia.org/wiki/Bloom_filter) which gives higher lookup performance (at the cost of potentially re-visiting *some* URLs). – Abe Voelker Jan 27 '14 at 20:02
  • Thanks for your comment. Also, please what is `scrapy's default URL filter`? In addition, may you post an official documentation of it? Thanks in advance – William Kinaan Jan 28 '14 at 15:57
  • @WilliamKinaan The default filter is class `RFPDupeFilter`, source here: https://github.com/scrapy/scrapy/blob/af16fa326feb1058153c06490c0dc931c240f57d/scrapy/dupefilter.py#L28 As for documentation, I doubt there is any on this specific class. Perhaps post your question on the scrapy mailing list: https://groups.google.com/forum/#!forum/scrapy-users – Abe Voelker Jan 28 '14 at 16:25
  • Thanks for your comment. I understand that even I create a class inherits from `RFPDupeFilter` like the above answer, or I just set the `DUPEFILTER_CLASS` variable in Settings to `RFPDupeFilter` right? – William Kinaan Jan 28 '14 at 19:01
3

https://github.com/scrapinghub/scrapylib/blob/master/scrapylib/deltafetch.py

This file might help you. This file creates a database of unique delta fetch key from the url ,a user pass in a scrapy.Reqeust(meta={'deltafetch_key':uniqe_url_key}). This this let you avoid duplicate requests you already have visited in the past.

A sample mongodb implementation using deltafetch.py

        if isinstance(r, Request):
            key = self._get_key(r)
            key = key+spider.name

            if self.db['your_collection_to_store_deltafetch_key'].find_one({"_id":key}):
                spider.log("Ignoring already visited: %s" % r, level=log.INFO)
                continue
        elif isinstance(r, BaseItem):

            key = self._get_key(response.request)
            key = key+spider.name
            try:
                self.db['your_collection_to_store_deltafetch_key'].insert({"_id":key,"time":datetime.now()})
            except:
                spider.log("Ignoring already visited: %s" % key, level=log.ERROR)
        yield r

eg. id = 345 scrapy.Request(url,meta={deltafetch_key:345},callback=parse)

Manoj Sahu
  • 2,774
  • 20
  • 18
1

Here is my custom filter base on scrapy 0.24.6.

In this filter, it only cares id in the url. for example

http://www.example.com/products/cat1/1000.html?p=1 http://www.example.com/products/cat2/1000.html?p=2

are treated as same url. But

http://www.example.com/products/cat2/all.html

will not.

import re
import os
from scrapy.dupefilter import RFPDupeFilter


class MyCustomURLFilter(RFPDupeFilter):

    def _get_id(self, url):
        m = re.search(r'(\d+)\.html', url)
        return None if m is None else m.group(1)

    def request_fingerprint(self, request):
        style_id = self._get_id(request.url)
        return style_id
chengbo
  • 5,789
  • 4
  • 27
  • 41
0

In the latest scrapy, we can use the default duplication filter or extend and have custom one.

define the below config in spider settings

DUPEFILTER_CLASS = 'scrapy.dupefilters.BaseDupeFilter'

Nagendran
  • 277
  • 2
  • 7