How to create a custom scrapy URL filter to avoid duplicates?

Question

I'm creating a scrappy crawler, but default filter class RFPDupeFilte doesn't work properly in the application. The crawler gives me lot of duplicate content.

So I tried following example, how to filter duplicate requests based on url in scrapy

But it didn't work for me. Its gives me an eroor ImportError: No module named scraper.custom_filters even though I saved it in custom_filters.py class in the same directory of settings.py.

from scrapy.dupefilter import RFPDupeFilter

class SeenURLFilter(RFPDupeFilter):
    """A dupe filter that considers the URL"""

    def __init__(self, path=None):
        self.urls_seen = set()
        RFPDupeFilter.__init__(self, path)

    def request_seen(self, request):
        if request.url in self.urls_seen:
            return True
        else:
            self.urls_seen.add(request.url)

Add the DUPEFILTER_CLASS constant to settings.py:

DUPEFILTER_CLASS = 'scraper.custom_filters.SeenURLFilter'

You have to share the code you tried and show a full error message. — Marcs, Oct 28 '16 at 10:15
Can you add the directory structure to the question too please? — , Oct 28 '16 at 15:10

score 2 · Answer 1 · answered Oct 28 '16 at 20:47

Your path specified in DUPEFILTER_CLASS = 'scraper.custom_filters.SeenURLFilter' is wrong, resulting in an import error. It's likely that you're missing a package, or included one you shouldn't have.

For your project, find your "scrapy.cfg" file, and trace the directory structure from that point to determine the namespace to use in your string. For yours to be correct, your directory structure would need to be similar to:

myproject
   |---<scraper>
   |   |---<spiders>
   |   |   |---__init__.py
   |   |   |---myspider.py
   |   |---__init__.py
   |   |---<...>
   |   |---custom_filters.py
   |   |---settings.py
   |---scrapy.cfg

How to create a custom scrapy URL filter to avoid duplicates?

1 Answers1