7

I want to load a local HTML file using Scrapy Splash and take save it as PNG/JPEG and then delete the HTML file

script = """
splash:go(args.url)
return splash:png()
"""
resp = requests.post('http://localhost:8050/run', json={
    'lua_source': script,
    'url': 'file://my_file.html'
})
resp.content

It returns me

Failed loading page (Protocol "" is unknown) Network error #301

I have also tried

yield SplashRequest(url=filepath, 
                    callback=self.parse_result,
                    meta={'filepath': filepath},
                    args={
                        'wait': 0.5,
                        'png': 1,
                    },
                    endpoint='render.html',
                )

But I get

2020-04-23 12:07:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying http://localhost:8050/render.html> (failed 1 times): 502 Bad Gateway

Umair Ayub
  • 19,358
  • 14
  • 72
  • 146

2 Answers2

0

You’re using Scrapy Splash to communicate ScrapingHub to generate the image. This only supports HTTP(s) requests. You can clone their repository and implement the changes.

Although it might be easier to serve the HTML through a web server. You can use localhost. However, if you’re running the ScrapingHub through a docker, then you’ll need to allow access to the ports.

Greg
  • 4,468
  • 3
  • 16
  • 26
  • 1
    Actually "then you could serve the HTML from a web server (as the code should be able to scrape localhost)." is wrong even serving from local host doesn't work – goku Jun 25 '20 at 03:41
0

It is not recommended to use localhost by the bottom two links. Some of the people mentioned turning off Crawlera fixed their problem. It could be trying to route your requests through online IPs to reach your localhost which would be problematic.

Scrapy Splash on Ubuntu server: got an unexpected keyword argument 'encoding'

https://github.com/scrapy-plugins/scrapy-splash/issues/108

  • you mean by crawlera the proxy provider? – goku Jun 26 '20 at 05:05
  • Why would I using a paid proxy provider while scraping local HTML more importantly, if crawlera middleware is active you can't open local HTML with it. Your "answer" is wrong on many levels. – goku Jun 26 '20 at 20:57