0

I am learning Scrapy to crawl thee websites. I want to track links/urls which spider could not fetch. And how to trigger final tasks after spider has completed its job.

Sample code is showing what I am looking for. of course its not real life case but I am learning so I want to achieve this.

In words can it be possible to create function_to_be_triggered_when_url_is_not_able_to_fetch like function which can track urls which spider could not fetch. Another thing is how to create function similar to function_to_be_triggered_when_spider_has_done_its_all_pending_jobs() which can be use for writing intermediate data to files or database or sending mail when spider has crawled all the domains.

Here is simple spider

class MySpider(CrawlSpider):
    name='spider1'
    allowed_domains=[i.split('\n')[0] for i in open('url_list.txt','r').readlines()]
    start_urls = ['http://'+i.split('\n')[0] for i in open('url_list.txt','r').readlines()]
    rules = [Rule(SgmlLinkExtractor(), callback='parse_item',follow=True)]

    def __init__(self,category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.count_fetched_urls=0 #to count successfully fetched urls from domain/site
        self.count_failed_to_fetch=0 #to count urls which could not fetch because of timeout or 4XX HTTP errors.

    def parse_item(self, response):
        self.count_fetched_urls=self.count_fetched_urls+1
        #some more useful lines to process fetched urls

    def function_to_be_triggered_when_url_is_not_able_to_fetch(self):
        self.count_failed_to_fetch=self.count_failed_to_fetch+1
        print self.count_failed_to_fetch,'urls are failed to fetch till now'

    def function_to_be_triggered_when_spider_has_done_its_all_pending_jobs():
        print 'Total domains/site:',len(self.start_urls)
        print 'total links/urls spider faced',self.count_fetched_urls+self.count_failed_to_fetch
        print 'Successfully fetch urls/links:',self.count_fetched_urls
        print 'Failed to fetch urls/links:',self.count_failed_to_fetch
Kara
  • 6,115
  • 16
  • 50
  • 57
Alok
  • 7,734
  • 8
  • 55
  • 100
  • possible duplicate of [How to get the scrapy failure URLs?](http://stackoverflow.com/questions/13724730/how-to-get-the-scrapy-failure-urls) – Talvalin Mar 07 '14 at 08:36
  • I've answered a question about how to get the failed URLs in scrapy previously (see "possible duplicate comment above"), so please check that and then perhaps edit your question to discuss the intermediate data write question. – Talvalin Mar 07 '14 at 08:38
  • Did that other answer help resolve your issue? :) – Talvalin Apr 01 '14 at 08:08
  • yes, and I have already voted up your answer :) – Alok Apr 01 '14 at 10:28

0 Answers0