0

During my crawling, some pages return a response with partial html body and status 200, after I compare the response body with the one I open in browser, the former one miss something. How can I catch this unexpected partial response body case in spider or in download middleware?

Below is about the log example:

2014-01-23 16:31:53+0100 [filmweb_multi] DEBUG: Crawled (408) http://www.filmweb.pl/film/Labirynt-2013-507169/photos> (referer: http://www.filmweb.pl/film/Labirynt-2013-507169) ['partial']

Girish
  • 883
  • 8
  • 16
dabing1205
  • 73
  • 1
  • 4

2 Answers2

0

Its not partial content as such. The rest of the content is dynamically loaded by a Javacript AJAX call.

To debug what content is being sent as response for a particular request, use Scrapy's open_in_browser() function.

There's another thread on How to extract dynamic content from websites that are using AJAX ?. Refer this for a workaround.

Community
  • 1
  • 1
Girish
  • 883
  • 8
  • 16
0

Seeing ['partial'] in the log means that the response is missing certain headers; see my answer here for more detail on what causes the partial flag.

To catch these responses, you can simply check the response flags. For example, if you created the request using Request(url=url, callback=self.parse), you would do the following in the callback:

def parse(self, response):
    if 'partial' in response.flags:
        # Do something with the response
        pass
Community
  • 1
  • 1
Daniel W
  • 527
  • 1
  • 5
  • 13