How to crawl dynamically generated data on google's webstore search results

Question

I want to crawl a web page which shows the results of a search in google's webstore and the link is static for that particular keyword.

I want to find the ranking of an extension periodically. Here is the URL

Problem is that I can't render the dynamic data generated by Javascript code in response from server.

I tried using Scrapy and Scrapy-Splash to render the desired page but I was still getting the same response. I used Docker to run an instance of scrapinghub/splash container on port 8050. I even visited the webpage http://localhost:8050 and entered my URL manually but it couldn't render the data although the message showed success.

Here's the code I wrote for the crawler. It actually does nothing and its only job is to fetch the HTML contents of the desired page.

import scrapy
from scrapy_splash import SplashRequest

class WebstoreSpider(scrapy.Spider):
    name = 'webstore'

    def start_requests(self):
        yield SplashRequest(
            url='https://chrome.google.com/webstore/search/netflix%20vpn?utm_source=chrome-ntp-icon&_category=extensions',
            callback=self.parse,
            args={
                "wait": 3,
            },
        )

    def parse(self, response):
        print(response.text)

and the contents of the settings.py of my Scrapy project:

BOT_NAME = 'webstore_cralwer'

SPIDER_MODULES = ['webstore_cralwer.spiders']
NEWSPIDER_MODULE = 'webstore_cralwer.spiders'

ROBOTSTXT_OBEY = False

SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

And for the result I always get nothing.

Any help is appreciated.

Is it possible that you are being detected as a bot and given an error response? Have you considered scraping the content without Splash? (see https://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content-from-websites-that-are-using-ajax and https://docs.scrapy.org/en/master/topics/dynamic-content.html) — Gallaecio, Jul 08 '19 at 08:43
Thanks for answering! I'm not sure if I'm detected as a bot because when I tried to render the URL manually in `localhost:8050`, success message was shown. I tried scraping the URL without Splash but that didn't help either and I was still getting the gibberish content. I even tried apify.com service but that couldn't render the page either. @Gallaecio — Mohi_k, Jul 08 '19 at 12:25

Wim Hermans · Answer 1 · 2019-07-08T14:50:06.417

Works for me with a small custom lua script:

lua_source = """
     function main(splash, args)
     assert(splash:go(args.url))
     assert(splash:wait(5.0))
     return {
        html = splash:html(),
     }
     end
"""

You can then change your start_requests as follows:

def start_requests(self):
        yield SplashRequest(
            url='https://chrome.google.com/webstore/search/netflix%20vpn?utm_source=chrome-ntp-icon&_category=extensions',
            callback=self.parse,
            args={'lua_source': self.lua_source},
        )

How to crawl dynamically generated data on google's webstore search results

1 Answers1