Questions tagged [splash-js-render]

Splash JS is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python using Twisted and QT. It's Selenium's competitor.

https://splash.readthedocs.io/en/stable/

Splash - A javascript rendering service

Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python using Twisted and QT. The (twisted) QT reactor is used to make the sever fully asynchronous allowing to take advantage of webkit concurrency via QT main loop. Some of Splash features:

  • process multiple webpages in parallel;
  • get HTML results and/or take screenshots;
  • turn OFF images or use Adblock Plus rules to make rendering faster;
  • execute custom JavaScript in page context;
  • write Lua browsing scripts;
  • develop Splash Lua scripts in Splash-Jupyter Notebooks.
  • get detailed rendering info in HAR format.
138 questions
27
votes
3 answers

Scrapy Shell and Scrapy Splash

We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container. If we want to use Splash in the spider, we configure several required project settings and yield a…
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
16
votes
3 answers

Adding a wait-for-element while performing a SplashRequest in python Scrapy

I am trying to scrape a few dynamic websites using Splash for Scrapy in python. However, I see that Splash fails to wait for the complete page to load in certain cases. A brute force way to tackle this problem was to add a large wait time (eg. 5…
NightFury13
  • 761
  • 7
  • 19
10
votes
1 answer

How to set splash timeout in scrapy-splash?

I use scrapy-splash to crawl web page, and run splash service on docker. commond: docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600 But I got a 504 error. "error": {"info": {"timeout": 30}, "description": "Timeout exceeded rendering…
Jhon Smith
  • 181
  • 2
  • 12
9
votes
3 answers

Scrapy CrawlSpider + Splash: how to follow links through linkextractor?

I have the following code that is partially working, class ThreadSpider(CrawlSpider): name = 'thread' allowed_domains = ['bbs.example.com'] start_urls = ['http://bbs.example.com/diy'] rules = ( Rule(LinkExtractor( …
eN_Joy
  • 853
  • 3
  • 11
  • 20
8
votes
3 answers

how does scrapy-splash handle infinite scrolling?

I want to reverse engineering the contents generated by scrolling down in the webpage. The problem is in the url https://www.crowdfunder.com/user/following_page/80159?user_id=80159&limit=0&per_page=20&screwrand=933. screwrand doesn't seem to follow…
Bowen Liu
  • 99
  • 2
  • 7
7
votes
0 answers

Splash containers stops working after 30 minutes

I have some issue with Aquarium and splash. They stop working after 30 minutes after the start. A number of pages for loading are 50K-80K. I made cron job for automatically rebooting every 10 minutes, each Splash container, but it didn't work How…
amarynets
  • 1,765
  • 10
  • 27
7
votes
2 answers

Using docker, scrapy splash on Heroku

I have a scrapy spider that uses splash which runs on Docker localhost:8050 to render javascript before scraping. I am trying to run this on heroku but have no idea how to configure heroku to start docker to run splash before running my web: scrapy…
HearthQiu
  • 261
  • 1
  • 2
  • 5
7
votes
1 answer

scrapy-splash returns its own headers and not the original headers from the site

I use scrapy-splash to build my spider. Now what I need is to maintain the session, so I use the scrapy.downloadermiddlewares.cookies.CookiesMiddleware and it handles the set-cookie header. I know it handles the set-cookie header because i set…
Roman Smelyansky
  • 319
  • 1
  • 13
7
votes
2 answers

How to install python-gtk2, python-webkit and python-jswebkit on OSX

I've read through many of the related questions but am still unclear how to do this as there are many software combinations available and many solutions seem outdated. What is the best way to install the following on my virtual environment on…
jyek
  • 1,081
  • 9
  • 19
6
votes
1 answer

How to send custom headers in a Scrapy Splash request?

My spider.py file is as so: def start_requests(self): for url in self.start_urls: yield scrapy.Request( url, self.parse, headers={'My-Custom-Header':'Custom-Header-Content'}, meta={ …
Nadun Perera
  • 565
  • 6
  • 15
6
votes
1 answer

Form Request Using Scrapy + Splash

I am trying to login to a website using the following code (slightly modified for this post): import scrapy from scrapy_splash import SplashRequest from scrapy.crawler import CrawlerProcess class Login_me(scrapy.Spider): name = 'espn' …
6
votes
1 answer

scrapy, splash, lua, button click

I am new to all instruments here. My goal is to extract all URLs from a lot of pages which are connected moreless by a "Weiter"/"next" button - that for several URLS. I decided to try that with scrapy. The page is dynamically generated. Then I…
P. Guyan
  • 61
  • 4
6
votes
0 answers

Docker Scrapinghub/splash exited with 139

I'm using Scrapy to do some crawling with Splash using the Scrapinghub/splash docker container however the container exit after a while by itself with exit code 139, I'm running the scraper on an AWS EC2 instance with 1GB swap assigned. i also tried…
MtziSam
  • 130
  • 11
6
votes
1 answer

Splash lua script to do multiple clicks and visits

I'm trying to crawl Google Scholar search results and get all the BiBTeX format of each result matching the search. Right now I have a Scrapy crawler with Splash. I have a lua script which will click the "Cite" link and load up the modal window…
5
votes
2 answers

Google App Engine: Load another Docker Image for Scrapy + Splash

I'd like to scrape a javascript website using Scrapy + Splash in Google App Engine. The Splash plugin is a Docker image. Is there any way to use this within Google App Engine? App Engine itself uses a Docker image, but I'm not sure how to load and…
bgolson
  • 3,460
  • 5
  • 24
  • 41
1
2 3
9 10