I am new to python and scrapy, and now I am making a simply scrapy project for scraping posts from a forum. However, sometimes when crawling the post, it got a 200 but redirect to empty page (maybe because the instability server of the forum or other reasons, but whatever). I would like to do a retry for all those fail scraping.
As it is too long to read all, I would like to summary some directions for my questions are:
1) Can I execute the retry using CustomRetryMiddleware only in one specific method
2) Can I do something after finish the first scraping
Okay let's start
The overall logic of my code is as below:
Crawl the homepage of forum
Crawl into every post from the homepage
Scrape the data from the post
def start_requests(self): yield scrapy.Request('https://www.forumurl.com', self.parse_page) def parse_page(self, response): //Going into all the threads hrefs = response.xpath('blahblah') for href in hrefs: url = response.urljoin(href.extract()) yield scrapy.Request(url, callback=self.parse_post) def parse_post(self, response): //really scraping the content content_empty = len(response.xpath('//table[@class="content"]') //check if the content is empty if content_empty == 0: //do something item = ForumItem() item['some_content'] = response.xpath('//someXpathCode') yield item
I have read lots from stackoverflow, and thought I can do it in two ways (and have done some coding):
1) Create a custom RetryMiddleware
2) Do the retry just inside the spider
However I am doing both of them with no lucks. The failure reasons is as below:
For Custom RetryMiddleware, I followed this, but it will check through all the page I crawled, including robot.txt, so it always retrying. But what I want is only do the retry check inside parse_post
. Is this possible?
For retry inside the spider, I have tried two approacch.
First, I added a class variable _posts_not_crawled = []
and append it with response.url
if the empty check is true. Adjust the code of start_requests to do the retry of all fail scraping after finishing scraping for the first time:
def start_requests(self):
yield scrapy.Request('https://www.forumurl.com', self.parse_page)
while self._post_not_crawled:
yield scrapy.Request(self._post_not_crawled.pop(0), callback=self.parse_post)
But of course it doesn't work, because it executes before actually scraping data, so it will only execute once with an empty _post_not_crawled
list before start scraping. Is it possible to do something after finish first scraping?
Second trial is to directly retry inside the parse_post()
if content_empty == 0:
logging.warning('Post was empty: ' + response.url)
retryrequest = scrapy.Request(response.url, callback=self.parse_post)
retryrequest.dont_filter = True
return retryrequest
else:
//do the scraping
Update some logs from this method
2017-09-03 05:15:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6778647> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:43 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6778647
2017-09-03 05:15:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6778568> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:44 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6778568
2017-09-03 05:15:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6774780> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:46 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6774780
But it doesn't work either, and the retryrequest
was just skipped without any sign.
Thanks for reading all of this. I appreciate all of your help.