I am using scrapy 0.2 with python2.7
I want to know if the links, which i am scraping on now, have been scraped before.
I searched a lot and I found this example how to filter duplicate requests based on url in scrapy
I copied the code and put it in my spider folder and changed the setting but I got this exception:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1237, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1099, in _inlineCallbacks
result = g.send(result)
File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\crawler.py", line 66, in start
yield self.engine.open_spider(self._spider, self._start_requests())
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1237, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1099, in _inlineCallbacks
result = g.send(result)
File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\core\engine.py", line 221, in open_spider
scheduler = self.scheduler_cls.from_crawler(self.crawler)
File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\core\scheduler.py", line 25, in from_crawler
dupefilter = dupefilter_cls.from_settings(settings)
exceptions.AttributeError: 'module' object has no attribute 'from_settings'
my code:
import os
from scrapy.dupefilter import RFPDupeFilter
from scrapy.utils.request import request_fingerprint
class CustomFilter(RFPDupeFilter):
def __getid(self, url):
mm = url.split("&refer")[0] #or something like that
return mm
def request_seen(self, request):
print "SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS"
fp = self.__getid(request.url)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
and in the setting i added this:
DUPEFILTER_CLASS = 'myproject.spiders.CustomFilter'