2

For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like:

def parse_page(self, response):
    url = response.url

    # Check to make sure the page is loaded
    if 'var PageIsLoaded = false;' in response.body:
        self.logger.warning('parse_page encountered an incomplete rendering of {}'.format(url))
        yield Request(url, self.parse, dont_filter=True)
        return

    ...
    # Normal parsing logic

However, it seems like when the retry logic gets called and a new Request is issued, the pages and the links they contain don't get crawled or scraped. My thought was that by using self.parse which the CrawlSpider uses to apply the crawl rules and dont_filter=True, I could avoid the duplicate filter. However with DUPEFILTER_DEBUG = True, I can see that the retry requests get filtered away.

Am I missing something, or is there a better way to handle this? I'd like to avoid the complication of doing dynamic js rendering using something like splash if possible, and this only happens intermittently.

JoshAdel
  • 66,734
  • 27
  • 141
  • 140

1 Answers1

5

I would think about having a custom Retry Middleware instead - similar to a built-in one.

Sample implementation (not tested):

import logging

logger = logging.getLogger(__name__)


class RetryMiddleware(object):
    def process_response(self, request, response, spider):
        if 'var PageIsLoaded = false;' in response.body:
            logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
            return self._retry(request) or response

        return response

    def _retry(self, request):
        logger.debug("Retrying %(request)s", {'request': request})

        retryreq = request.copy()
        retryreq.dont_filter = True
        return retryreq

And don't forget to activate it.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • The middleware worked beautifully. I was able to pretty much re-use the builtin retry middleware and just strip out the exception codes stuff and replace it with my own test. Thanks again for your help. – JoshAdel Sep 20 '15 at 01:23
  • @JoshAdel yeah, scrapy really makes everything quite modular and clean - all of the pipelines, middlewares, extensions, item loaders, input and output processors - great API interface in the non-trivial and often messy web-scraping field. Glad to help and thanks for an interesting question! – alecxe Sep 20 '15 at 01:30