1

I am new to python and scrapy, and now I am making a simply scrapy project for scraping posts from a forum. However, sometimes when crawling the post, it got a 200 but redirect to empty page (maybe because the instability server of the forum or other reasons, but whatever). I would like to do a retry for all those fail scraping.

As it is too long to read all, I would like to summary some directions for my questions are:

1) Can I execute the retry using CustomRetryMiddleware only in one specific method

2) Can I do something after finish the first scraping

Okay let's start

The overall logic of my code is as below:

  1. Crawl the homepage of forum

  2. Crawl into every post from the homepage

  3. Scrape the data from the post

      def start_requests(self):
          yield scrapy.Request('https://www.forumurl.com', self.parse_page)
    
      def parse_page(self, response): //Going into all the threads
          hrefs = response.xpath('blahblah')
          for href in hrefs:
              url = response.urljoin(href.extract())
              yield scrapy.Request(url, callback=self.parse_post)
    
      def parse_post(self, response): //really scraping the content
          content_empty = len(response.xpath('//table[@class="content"]') //check if the content is empty
          if content_empty == 0:
              //do something
    
          item = ForumItem()
          item['some_content'] = response.xpath('//someXpathCode')
    
          yield item
    

I have read lots from stackoverflow, and thought I can do it in two ways (and have done some coding):

1) Create a custom RetryMiddleware

2) Do the retry just inside the spider

However I am doing both of them with no lucks. The failure reasons is as below:

For Custom RetryMiddleware, I followed this, but it will check through all the page I crawled, including robot.txt, so it always retrying. But what I want is only do the retry check inside parse_post. Is this possible?

For retry inside the spider, I have tried two approacch.

First, I added a class variable _posts_not_crawled = [] and append it with response.url if the empty check is true. Adjust the code of start_requests to do the retry of all fail scraping after finishing scraping for the first time:

def start_requests(self):
    yield scrapy.Request('https://www.forumurl.com', self.parse_page)
    while self._post_not_crawled:
        yield scrapy.Request(self._post_not_crawled.pop(0), callback=self.parse_post)

But of course it doesn't work, because it executes before actually scraping data, so it will only execute once with an empty _post_not_crawled list before start scraping. Is it possible to do something after finish first scraping?

Second trial is to directly retry inside the parse_post()

    if content_empty == 0:
        logging.warning('Post was empty: ' + response.url)
        retryrequest = scrapy.Request(response.url, callback=self.parse_post)
        retryrequest.dont_filter = True
        return retryrequest
    else:
      //do the scraping

Update some logs from this method

2017-09-03 05:15:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6778647> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:43 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6778647
2017-09-03 05:15:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6778568> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:44 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6778568
2017-09-03 05:15:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.hkgolden.com/view.aspx?type=BW&message=6774780> (referer: https://forum.hkgolden.com/topics.aspx?type=BW&page=2)
2017-09-03 05:15:46 [root] WARNING: Post was empty: https://forum.hkgolden.com/view.aspx?type=BW&message=6774780

But it doesn't work either, and the retryrequest was just skipped without any sign.

Thanks for reading all of this. I appreciate all of your help.

Joe Leung
  • 121
  • 9
  • Change `return retryrequest` to `yield retryrequest`. And you can also use `retryrequest = scrapy.Request(response.url, callback=self.parse_post, don_filter=True)` in one line. – Tarun Lalwani Sep 01 '17 at 10:43
  • Thanks for answering. Have tried yield retryrequest, but still the program didn't do anything :( – Joe Leung Sep 01 '17 at 11:32
  • Are you sure you are hitting this condition? Because this should work? You need to create a pastebin with the logs for us to check what is wrong – Tarun Lalwani Sep 01 '17 at 13:02
  • Thanks again. Actually I have added some logging statement, so I am sure it hit the condition. I have updated the logs in my question, and we can see that, after the log, it just skip to do the retry and go to next post directly. – Joe Leung Sep 03 '17 at 05:20
  • Your request will not be retried immediately as it will be queued. So you might see the log for retried data a lot later – Tarun Lalwani Sep 03 '17 at 05:38

0 Answers0