2

I am crawling a site which uses lazy loading for product images.

For this reason i included scrapy-splash so that the javascript can be rendered also with splash i can provide a wait argument. Previously i had a though that it is because of the timing that the raw scrapy.Request is returning a placeholder image instead of the originals.

I've tried wait argument to 29.0 secs also, but still my crawler hardly getting 10 items (it should bring 280 items based on calculations). I have a item pipleline which checks if the image is empty in the item so i raise DropItem.

I am not sure, but i also noticed that its not only the wait problem. It looks like that images gets loaded when i scroll down.

What i am looking for is a way to automate a scroll to bottom behaviour within my requests.

Here is my code Spider

  def parse(self, response):
        categories = response.css('div.navigation-top-links a.uppercase::attr(href)').extract()
        for category in categories:
            link = urlparse.urljoin(self.start_urls[0], category)
            yield SplashRequest(link, callback=self.parse_products_listing, endpoint='render.html',
                                       args={'wait': 0.5})

Pipeline

class ScraperPipeline(object):
    def process_item(self, item, spider):
        if not item['images']:
            raise DropItem

        return item

Settings

IMAGES_STORE = '/scraper/images'
        
SPLASH_URL = 'http://172.22.0.2:8050'

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'


ITEM_PIPELINES = {
   'scraper.pipelines.ScraperPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 1
}


DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    # 'custom_middlewares.middleware.ProxyMiddleware': 210,
}
Raheel
  • 8,716
  • 9
  • 60
  • 102

1 Answers1

4

If you are set on using splash this answer should give you some guidance: https://stackoverflow.com/a/40366442/7926936

You could also use selenium in a DownloaderMiddleware, this is a example I have for a Twitter scraper that will get the first 200 tweets of a page:

from selenium import webdriver
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait


class SeleniumMiddleware(object):

    def __init__(self):
        self.driver = webdriver.PhantomJS()

    def process_request(self, request, spider):
        self.driver.get(request.url)
        tweets = self.driver.find_elements_by_xpath("//li[@data-item-type='tweet']")
        while len(tweets) < 200:
            try:
                self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                WebDriverWait(self.driver, 10).until(
                    lambda driver: new_posts(driver, len(tweets)))
                tweets = self.driver.find_elements_by_xpath("//li[@data-item-type='tweet']")
            except TimeoutException:
                break
        body = self.driver.page_source
        return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)


def new_posts(driver, min_len):
    return len(driver.find_elements_by_xpath("//li[@data-item-type='tweet']")) > min_len

In the while loop I am waiting in each loop for new tweets until there are 200 tweets loaded in the page and have a 10 seconds max wait.

Henrique Coura
  • 822
  • 7
  • 16