scrapy: how to catch the unexpected case of return a response with Partial html body and status 200

Question

During my crawling, some pages return a response with partial html body and status 200, after I compare the response body with the one I open in browser, the former one miss something. How can I catch this unexpected partial response body case in spider or in download middleware?

Below is about the log example：

2014-01-23 16:31:53+0100 [filmweb_multi] DEBUG: Crawled (408) http://www.filmweb.pl/film/Labirynt-2013-507169/photos> (referer: http://www.filmweb.pl/film/Labirynt-2013-507169) ['partial']

does this happens with proxy or direct scrapping ? – Frederic Bazin Jul 09 '14 at 13:57 — Frederic Bazin, Jul 09 '14 at 13:57

score 0 · Answer 1 · edited May 23 '17 at 12:00

0

Its not partial content as such. The rest of the content is dynamically loaded by a Javacript AJAX call.

To debug what content is being sent as response for a particular request, use Scrapy's open_in_browser() function.

There's another thread on How to extract dynamic content from websites that are using AJAX ?. Refer this for a workaround.

edited May 23 '17 at 12:00

Community

1
1

answered Jul 09 '14 at 14:48

Girish

883
8
16

score 0 · Answer 2 · edited May 23 '17 at 12:15

Seeing ['partial'] in the log means that the response is missing certain headers; see my answer here for more detail on what causes the partial flag.

To catch these responses, you can simply check the response flags. For example, if you created the request using Request(url=url, callback=self.parse), you would do the following in the callback:

def parse(self, response):
    if 'partial' in response.flags:
        # Do something with the response
        pass

scrapy: how to catch the unexpected case of return a response with Partial html body and status 200

2 Answers2