1

I am using scrapy 0.2 with python2.7

I want to know if the links, which i am scraping on now, have been scraped before.

I searched a lot and I found this example how to filter duplicate requests based on url in scrapy

I copied the code and put it in my spider folder and changed the setting but I got this exception:

Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1237, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
  File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1099, in _inlineCallbacks
    result = g.send(result)
  File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\crawler.py", line 66, in start
    yield self.engine.open_spider(self._spider, self._start_requests())
  File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1237, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1099, in _inlineCallbacks
    result = g.send(result)
  File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\core\engine.py", line 221, in open_spider
    scheduler = self.scheduler_cls.from_crawler(self.crawler)
  File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\core\scheduler.py", line 25, in from_crawler
    dupefilter = dupefilter_cls.from_settings(settings)
exceptions.AttributeError: 'module' object has no attribute 'from_settings'

my code:

import os

from scrapy.dupefilter import RFPDupeFilter
from scrapy.utils.request import request_fingerprint

class CustomFilter(RFPDupeFilter):
    def __getid(self, url):
        mm = url.split("&refer")[0] #or something like that
        return mm

    def request_seen(self, request):
        print "SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS"
        fp = self.__getid(request.url)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

and in the setting i added this:

DUPEFILTER_CLASS = 'myproject.spiders.CustomFilter'
Community
  • 1
  • 1
Marco Dinatsoli
  • 10,322
  • 37
  • 139
  • 253

0 Answers0