6

I have a problem like this:

how to filter duplicate requests based on url in scrapy

So, I do not want a website to be crawled more than once. I adapted the middleware and wrote a print statement to test whether it correctly classifies already seen websites. It does.

Nonetheless the parsing seems to be executed multiple times because the json-File I receive contains double entries.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item

from crawlspider.items import KickstarterItem

from HTMLParser import HTMLParser

### code for stripping off HTML tags:
class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return str(''.join(self.fed))

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
###

items = []

class MySpider(CrawlSpider):
    name = 'kickstarter'
    allowed_domains = ['kickstarter.com']
    start_urls = ['http://www.kickstarter.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('discover/categories/comics', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(allow=('projects/', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)

        hxs = HtmlXPathSelector(response)
        item = KickstarterItem()

        item['date'] = hxs.select('//*[@id="about"]/div[2]/ul/li[1]/text()').extract()
        item['projname'] = hxs.select('//*[@id="title"]/a').extract()
        item['projname'] = strip_tags(str(item['projname']))

        item['projauthor'] = hxs.select('//*[@id="name"]')
        item['projauthor'] = item['projauthor'].select('string()').extract()[0]

        item['backers'] = hxs.select('//*[@id="backers_count"]/data').extract()
        item['backers'] = strip_tags(str(item['backers']))

        item['collmoney'] = hxs.select('//*[@id="pledged"]/data').extract()
        item['collmoney'] = strip_tags(str(item['collmoney']))

        item['goalmoney'] = hxs.select('//*[@id="stats"]/h5[2]/text()').extract()
        items.append(item)
        return items

My items.py looks like that:

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class KickstarterItem(Item):
    # define the fields for your item here like:
    date = Field()
    projname = Field()
    projauthor = Field()
    backers = Field()
    collmoney = Field()
    goalmoney = Field()
    pass

My middleware looks like this:

import os

from scrapy.dupefilter import RFPDupeFilter
from scrapy.utils.request import request_fingerprint

class CustomFilter(RFPDupeFilter):
def __getid(self, url):
    mm = url.split("/")[4] #extracts project-id (is a number) from project-URL
    print "_____________", mm
    return mm

def request_seen(self, request):
    fp = self.__getid(request.url)
    self.fingerprints.add(fp)
    if fp in self.fingerprints and fp.isdigit(): # .isdigit() checks wether fp comes from a project ID
        print "______fp is a number (therefore a project-id) and has been encountered before______"
        return True
    if self.file:
        self.file.write(fp + os.linesep)

I added this line to settings.py:

DUPEFILTER_CLASS = 'crawlspider.duplicate_filter.CustomFilter'

I call the script using "scrapy crawl kickstarter -o items.json -t json". Then I see the correct print statements from the middleware code. Any comments on why the json contains multiple entries containing the same data?

Community
  • 1
  • 1
Damian
  • 139
  • 3
  • 13
  • Could you post all of your spider code please? It'll make it easier to track duplicates. :) – Talvalin Feb 02 '13 at 21:27
  • @Talvalin I added some code to my original post. Thanks for your help. – Damian Feb 03 '13 at 08:14
  • Any concrete reason why you are not using the default scrapy method for this since there is already a [dupefilter](https://scrapy.readthedocs.org/en/latest/topics/settings.html?highlight=settings#dupefilter-class) implemented and active AFAIK. – DrColossos Feb 04 '13 at 09:09
  • @DrColossos : Would this standard dupefilter be activated if I just removed the line "DUPEFILTER_CLASS = 'crawlspider.duplicate_filter.CustomFilter'" from settings.py? When I try this, I still have those multiplele entries in the json-File. – Damian Feb 04 '13 at 11:48
  • I remember the reason why I did not use the standard dupefilter: there is a set of overview pages which contain links to ten project sites each that I am interested in. So the revisiting policy should be that I just do not want to revisit project sites. So maybe the point is to change the RULE-section within the spider? – Damian Feb 04 '13 at 12:03
  • @DrColossos It's because The duplicate filter in the scheduler only filters out the URLs already seen in a single spider run (meaning that it will get reset on subsequent runs). – mrudult Dec 26 '13 at 12:15

1 Answers1

1

So now these are the three modifications that removed the duplicates:

I added this to settings.py: ITEM_PIPELINES = ['crawlspider.pipelines.DuplicatesPipeline',]

to let scrapy know that I added a function DuplicatesPipeline in pipelines.py:

from scrapy import signals
from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['projname'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['projname'])
            return item

You do not need to adjust the spider and do not use the dupefilter/middleware stuff I posted before.

But I got the feeling that my solution doesn't reduce the communication as the Item-object has to be created first before it is evaluated and possibly dropped. But I am okay with that.

(Solution found by asker, moved into an answer)

Jason S
  • 13,538
  • 2
  • 37
  • 42
  • 2
    looks like you will still scrap all duplicate pages. You only filter items before outputing results – Temak Apr 11 '17 at 22:11
  • Although the question talks about duplicate requests being the problem, I believe that different requests where including duplicate items and that was the actual underlying problem, so the solution is correct, it is the question what needs some work. – Gallaecio Jan 14 '19 at 10:57