I am crawling a site which uses lazy loading for product images.
For this reason i included scrapy-splash
so that the javascript can be rendered also with splash i can provide a wait
argument. Previously i had a though that it is because of the timing that the raw scrapy.Request
is returning a placeholder image instead of the originals.
I've tried wait argument to 29.0 secs also, but still my crawler hardly getting 10 items (it should bring 280 items based on calculations). I have a item pipleline which checks if the image is empty in the item so i raise DropItem
.
I am not sure, but i also noticed that its not only the wait
problem. It looks like that images gets loaded when i scroll down.
What i am looking for is a way to automate a scroll to bottom behaviour within my requests.
Here is my code Spider
def parse(self, response):
categories = response.css('div.navigation-top-links a.uppercase::attr(href)').extract()
for category in categories:
link = urlparse.urljoin(self.start_urls[0], category)
yield SplashRequest(link, callback=self.parse_products_listing, endpoint='render.html',
args={'wait': 0.5})
Pipeline
class ScraperPipeline(object):
def process_item(self, item, spider):
if not item['images']:
raise DropItem
return item
Settings
IMAGES_STORE = '/scraper/images'
SPLASH_URL = 'http://172.22.0.2:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
ITEM_PIPELINES = {
'scraper.pipelines.ScraperPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 1
}
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
# 'custom_middlewares.middleware.ProxyMiddleware': 210,
}