I am learning Scrapy to crawl thee websites. I want to track links/urls which spider could not fetch. And how to trigger final tasks after spider has completed its job.
Sample code is showing what I am looking for. of course its not real life case but I am learning so I want to achieve this.
In words can it be possible to create function_to_be_triggered_when_url_is_not_able_to_fetch
like function which can track urls which spider could not fetch. Another thing is how to create function similar to function_to_be_triggered_when_spider_has_done_its_all_pending_jobs()
which can be use for writing intermediate data to files or database or sending mail when spider has crawled all the domains.
Here is simple spider
class MySpider(CrawlSpider):
name='spider1'
allowed_domains=[i.split('\n')[0] for i in open('url_list.txt','r').readlines()]
start_urls = ['http://'+i.split('\n')[0] for i in open('url_list.txt','r').readlines()]
rules = [Rule(SgmlLinkExtractor(), callback='parse_item',follow=True)]
def __init__(self,category=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.count_fetched_urls=0 #to count successfully fetched urls from domain/site
self.count_failed_to_fetch=0 #to count urls which could not fetch because of timeout or 4XX HTTP errors.
def parse_item(self, response):
self.count_fetched_urls=self.count_fetched_urls+1
#some more useful lines to process fetched urls
def function_to_be_triggered_when_url_is_not_able_to_fetch(self):
self.count_failed_to_fetch=self.count_failed_to_fetch+1
print self.count_failed_to_fetch,'urls are failed to fetch till now'
def function_to_be_triggered_when_spider_has_done_its_all_pending_jobs():
print 'Total domains/site:',len(self.start_urls)
print 'total links/urls spider faced',self.count_fetched_urls+self.count_failed_to_fetch
print 'Successfully fetch urls/links:',self.count_fetched_urls
print 'Failed to fetch urls/links:',self.count_failed_to_fetch