On occasion I'll have a response with unexpected html and all of the item fields will not be extracted, however, if I retry the request, it will typically return the expected html.
As a quick fix I'm catching the error in the spiders parse
method:
# project/spiders/sample_spider.py
class SampleSpider(Spider):
[...]
def parse(self, response):
try:
item = SampleItem()
item['sample_1'] = response.xpath('sample').extract()
item['product_count_2'] = response.xpath('sample').extract()[0]
yield item
except IndexError:
logger.debug('Retrying %(url)s', {'url': response.url})
yield Request(response.url, self.parse, dont_filter=True)
I came across this post which appears to be a similar scenario, but it seems this type of error should be handled in a Item Pipeline... Any thoughts on the best way to implement this fix?