2

I'm trying to scrape E-commerce website, example link: https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1

Data is being rendered via React and when i perform scraping on few links most of the data is being returned as null, and when i view the page source i cannot find actually HTML that is available via inspect element, just a json inside Javascript tags. I tested few times running scrapy scraper on the same links and data which was not found before, actually returns content, so its somehow randomly. I cannot figure out how should i scrape this kind of website. As well i'm using pool of useragents and breaks between requests.

 script = '''
        function main(splash, args)
            assert(splash:go(args.url))
            assert(splash:wait(1.5))
            return splash:html()
        end
    '''

    def start_requests(self):
        url= [
            'https://www.lazada.sg/products/esogoal-tactical-sling-bag-outdoor-chest-pack-shoulder-backpack-military-sport-bag-for-trekking-camping-hiking-rover-sling-daypack-for-men-women-i204814494-s353896924.html?mp=1',
            'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1',
            'https://www.lazada.sg/products/esogoal-selfie-stick-tripod-extendable-selfie-stick-monopod-with-integrated-tripod-and-bluetooth-remote-shutter-wireless-selfie-stick-tripod-for-cellphonecameras-i205279097-s309050125.html?mp=1',
            'https://www.lazada.sg/products/esogoal-mini-umbrella-travel-umbrella-sun-rain-umbrella8-ribs-98cm-big-surface-lightweight-compact-parasol-uv-protection-for-men-women-i204815487-s308312226.html?mp=1',
            'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1'
        ]

        for link in url:
            yield SplashRequest(url=link, callback=self.parse, endpoint='render.html', args={'wait' : 0.5, 'lua_source' : self.script}, dont_filter=True)

 def parse(self, response):
        yield {
            'title' : response.xpath("//span[@class='pdp-mod-product-badge-title']/text()").extract_first(),
            'price' : response.xpath("//span[contains(@class, 'pdp-price')]/text()").extract_first(),
            'description' : response.xpath("//div[@id='module_product_detail']").extract_first()
        }
Andrew
  • 1,507
  • 1
  • 22
  • 42
  • Have you considered, instead of using Splash, to go the other way and try to get the underlying data in the same way their React code does? See https://stackoverflow.com/q/8550114/939364 and https://docs.scrapy.org/en/master/topics/dynamic-content.html – Gallaecio Jul 08 '19 at 08:38
  • Maybe they're blocking the requests based on the IP. Have you considered using proxies? For me it seems to work when I use a Splash instance that runs locally, but the requests from a Splash instance that runs in Google Cloud don't work correctly. – Wim Hermans Jul 08 '19 at 14:59

1 Answers1

2

I try this:

  • Pass 'execute' as argument of the splash method instead of 'render html'

    from scrapy_splash import SplashRequest
    
    class DynamicSpider(scrapy.Spider):
    name = 'products'
    url = [
        'https://www.lazada.sg/products/esogoal-tactical-sling-bag-outdoor-chest-pack-shoulder-backpack-military-sport-bag-for-trekking-camping-hiking-rover-sling-daypack-for-men-women-i204814494-s353896924.html?mp=1',
        'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1',
        'https://www.lazada.sg/products/esogoal-selfie-stick-tripod-extendable-selfie-stick-monopod-with-integrated-tripod-and-bluetooth-remote-shutter-wireless-selfie-stick-tripod-for-cellphonecameras-i205279097-s309050125.html?mp=1',
        'https://www.lazada.sg/products/esogoal-mini-umbrella-travel-umbrella-sun-rain-umbrella8-ribs-98cm-big-surface-lightweight-compact-parasol-uv-protection-for-men-women-i204815487-s308312226.html?mp=1',
        'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1',
    ]
    
    script = """
        function main(splash, args)
          assert(splash:go(args.url))
          assert(splash:wait(1.5))
          return {
            html = splash:html()
          }
        end
    """
    
    def start_requests(self):
        for link in self.url:
            yield SplashRequest(
                url=link,
                callback=self.parse,
                endpoint='execute',
                args={'wait': 0.5, 'lua_source': self.script},
                dont_filter=True,
            )
    
    def parse(self, response):
        yield {
            'title': response.xpath("//span[@class='pdp-mod-product-badge-title']/text()").extract_first(),
            'price': response.xpath("//span[contains(@class, 'pdp-price')]/text()").extract_first(),
            'description': response.xpath("//div[@id='module_product_detail']/h2/text()").extract_first()
        }
    

An this is the result enter image description here

GmrYael
  • 385
  • 3
  • 11