I'm creating a scrappy crawler, but default filter class RFPDupeFilte doesn't work properly in the application. The crawler gives me lot of duplicate content.
So I tried following example, how to filter duplicate requests based on url in scrapy
But it didn't work for me. Its gives me an eroor ImportError: No module named scraper.custom_filters even though I saved it in custom_filters.py class in the same directory of settings.py.
from scrapy.dupefilter import RFPDupeFilter
class SeenURLFilter(RFPDupeFilter):
"""A dupe filter that considers the URL"""
def __init__(self, path=None):
self.urls_seen = set()
RFPDupeFilter.__init__(self, path)
def request_seen(self, request):
if request.url in self.urls_seen:
return True
else:
self.urls_seen.add(request.url)
Add the DUPEFILTER_CLASS constant to settings.py:
DUPEFILTER_CLASS = 'scraper.custom_filters.SeenURLFilter'