27

We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container.

If we want to use Splash in the spider, we configure several required project settings and yield a Request specifying specific meta arguments:

yield Request(url, self.parse_result, meta={
    'splash': {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,

            # 'url' is prefilled from request url
        },

        # optional parameters
        'endpoint': 'render.json',  # optional; default is render.json
        'splash_url': '<url>',      # overrides SPLASH_URL
        'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
    }
})

This works as documented. But, how can we use scrapy-splash inside the Scrapy Shell?

Gallaecio
  • 3,620
  • 2
  • 25
  • 64
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 3
    It's true there's no `DEFAULT_REQUEST_META` like there is a [DEFAULT_REQUEST_HEADERS](http://doc.scrapy.org/en/latest/topics/settings.html?#std:setting-DEFAULT_REQUEST_HEADERS) which would be a nice addition. There are open discussions on enabling Splash by default via a middleware (see https://github.com/scrapinghub/scrapy-splash/issues/11). Another option is to subclass scrapy-splash mdw and force settings there. Ideas welcome on https://github.com/scrapinghub/scrapy-splash/issues – paul trmbrth Feb 12 '16 at 12:57

3 Answers3

43

just wrap the URL you want to shell to in splash HTTP API.

So you would want something like:

scrapy shell 'http://localhost:8050/render.html?url=http://example.com/page-with-javascript.html&timeout=10&wait=0.5'

where:

  • localhost:port is where your splash service is running
  • url is URL you want to crawl and don't forget to urlquote it!
  • render.html is one of the possible HTTP API endpoints, returns redered HTML page in this case
  • timeout time in seconds for timeout
  • wait time in seconds to wait for JavaScript to execute before reading/saving the HTML.
Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • you can probably make a bash alias to make this more convenient. – Granitosaurus Feb 12 '16 at 10:01
  • @StephenOstermiller you just uppercased some words and ruined the formatting. – Granitosaurus Jul 11 '22 at 10:53
  • Something is funky with the markdown formatting, I've never seen trailing white space introduce new lines in the output. Using list formatting will prevent preserve the new lines. I also use `example.com` instead of a non-example `.com` which is the main reason for the edit. – Stephen Ostermiller Jul 11 '22 at 10:58
19

You can run scrapy shell without arguments inside a configured Scrapy project, then create req = scrapy_splash.SplashRequest(url, ...) and call fetch(req).

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
0

For the windows users, who use Docker Toolbox:

  1. Change the single inverted comma with double inverted comma for preventing the invalid hostname:http error.

  2. change the localhost to the docker IP address which is below the whale logo. for me it was 192.168.99.100.

Finally I got this:

scrapy shell "http://192.168.99.100:8050/render.html?url="https://example.com/category/banking-insurance-financial-services/""

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
Uchiha AJ
  • 147
  • 3
  • 16