Using middleware to prevent scrapy from double-visiting websites

Question

I have a problem like this:

how to filter duplicate requests based on url in scrapy

So, I do not want a website to be crawled more than once. I adapted the middleware and wrote a print statement to test whether it correctly classifies already seen websites. It does.

Nonetheless the parsing seems to be executed multiple times because the json-File I receive contains double entries.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item

from crawlspider.items import KickstarterItem

from HTMLParser import HTMLParser

### code for stripping off HTML tags:
class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return str(''.join(self.fed))

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
###

items = []

class MySpider(CrawlSpider):
    name = 'kickstarter'
    allowed_domains = ['kickstarter.com']
    start_urls = ['http://www.kickstarter.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('discover/categories/comics', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(allow=('projects/', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)

        hxs = HtmlXPathSelector(response)
        item = KickstarterItem()

        item['date'] = hxs.select('//*[@id="about"]/div[2]/ul/li[1]/text()').extract()
        item['projname'] = hxs.select('//*[@id="title"]/a').extract()
        item['projname'] = strip_tags(str(item['projname']))

        item['projauthor'] = hxs.select('//*[@id="name"]')
        item['projauthor'] = item['projauthor'].select('string()').extract()[0]

        item['backers'] = hxs.select('//*[@id="backers_count"]/data').extract()
        item['backers'] = strip_tags(str(item['backers']))

        item['collmoney'] = hxs.select('//*[@id="pledged"]/data').extract()
        item['collmoney'] = strip_tags(str(item['collmoney']))

        item['goalmoney'] = hxs.select('//*[@id="stats"]/h5[2]/text()').extract()
        items.append(item)
        return items

My items.py looks like that:

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class KickstarterItem(Item):
    # define the fields for your item here like:
    date = Field()
    projname = Field()
    projauthor = Field()
    backers = Field()
    collmoney = Field()
    goalmoney = Field()
    pass

My middleware looks like this:

import os

from scrapy.dupefilter import RFPDupeFilter
from scrapy.utils.request import request_fingerprint

class CustomFilter(RFPDupeFilter):
def __getid(self, url):
    mm = url.split("/")[4] #extracts project-id (is a number) from project-URL
    print "_____________", mm
    return mm

def request_seen(self, request):
    fp = self.__getid(request.url)
    self.fingerprints.add(fp)
    if fp in self.fingerprints and fp.isdigit(): # .isdigit() checks wether fp comes from a project ID
        print "______fp is a number (therefore a project-id) and has been encountered before______"
        return True
    if self.file:
        self.file.write(fp + os.linesep)

I added this line to settings.py:

DUPEFILTER_CLASS = 'crawlspider.duplicate_filter.CustomFilter'

I call the script using "scrapy crawl kickstarter -o items.json -t json". Then I see the correct print statements from the middleware code. Any comments on why the json contains multiple entries containing the same data?

Could you post all of your spider code please? It'll make it easier to track duplicates. :) — Talvalin, Feb 02 '13 at 21:27
@Talvalin I added some code to my original post. Thanks for your help. — Damian, Feb 03 '13 at 08:14
Any concrete reason why you are not using the default scrapy method for this since there is already a [dupefilter](https://scrapy.readthedocs.org/en/latest/topics/settings.html?highlight=settings#dupefilter-class) implemented and active AFAIK. — DrColossos, Feb 04 '13 at 09:09
@DrColossos : Would this standard dupefilter be activated if I just removed the line "DUPEFILTER_CLASS = 'crawlspider.duplicate_filter.CustomFilter'" from settings.py? When I try this, I still have those multiplele entries in the json-File. — Damian, Feb 04 '13 at 11:48
I remember the reason why I did not use the standard dupefilter: there is a set of overview pages which contain links to ten project sites each that I am interested in. So the revisiting policy should be that I just do not want to revisit project sites. So maybe the point is to change the RULE-section within the spider? — Damian, Feb 04 '13 at 12:03
@DrColossos It's because The duplicate filter in the scheduler only filters out the URLs already seen in a single spider run (meaning that it will get reset on subsequent runs). — mrudult, Dec 26 '13 at 12:15

score 1 · Answer 1 · answered Aug 02 '14 at 17:31

So now these are the three modifications that removed the duplicates:

I added this to settings.py: ITEM_PIPELINES = ['crawlspider.pipelines.DuplicatesPipeline',]

to let scrapy know that I added a function DuplicatesPipeline in pipelines.py:

from scrapy import signals
from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['projname'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['projname'])
            return item

You do not need to adjust the spider and do not use the dupefilter/middleware stuff I posted before.

But I got the feeling that my solution doesn't reduce the communication as the Item-object has to be created first before it is evaluated and possibly dropped. But I am okay with that.

(Solution found by asker, moved into an answer)

looks like you will still scrap all duplicate pages. You only filter items before outputing results — Temak, Apr 11 '17 at 22:11
Although the question talks about duplicate requests being the problem, I believe that different requests where including duplicate items and that was the actual underlying problem, so the solution is correct, it is the question what needs some work. — Gallaecio, Jan 14 '19 at 10:57

Using middleware to prevent scrapy from double-visiting websites

1 Answers1