Questions tagged [scrapy-splash]

scrapy-splash is a scrapy plugin to integrate Scrapy framework with Splash - the JavaScript rendering service

scrapy-splash is a scrapy plugin to integrate Scrapy framework with Splash - the JavaScript rendering service

594 questions
27
votes
3 answers

Scrapy Shell and Scrapy Splash

We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container. If we want to use Splash in the spider, we configure several required project settings and yield a…
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
16
votes
3 answers

Adding a wait-for-element while performing a SplashRequest in python Scrapy

I am trying to scrape a few dynamic websites using Splash for Scrapy in python. However, I see that Splash fails to wait for the complete page to load in certain cases. A brute force way to tackle this problem was to add a large wait time (eg. 5…
NightFury13
  • 761
  • 7
  • 19
14
votes
1 answer

Does using scrapy-splash significantly affect scraping speed?

So far, I have been using just scrapy and writing custom classes to deal with websites using ajax. But if I were to use scrapy-splash, which from what I understand, scrapes the rendered html after javascript, will the speed of my crawler be affected…
hsy
  • 165
  • 1
  • 1
  • 8
10
votes
1 answer

Scrapy Splash Screenshots?

I'm trying to scrape a site whilst taking a screenshot of every page. So far, I have managed to piece together the following code: import json import base64 import scrapy from scrapy_splash import SplashRequest class ExtractSpider(scrapy.Spider): …
Exam Orph
  • 365
  • 9
  • 18
10
votes
1 answer

How to set splash timeout in scrapy-splash?

I use scrapy-splash to crawl web page, and run splash service on docker. commond: docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600 But I got a 504 error. "error": {"info": {"timeout": 30}, "description": "Timeout exceeded rendering…
Jhon Smith
  • 181
  • 2
  • 12
9
votes
3 answers

Scrapy CrawlSpider + Splash: how to follow links through linkextractor?

I have the following code that is partially working, class ThreadSpider(CrawlSpider): name = 'thread' allowed_domains = ['bbs.example.com'] start_urls = ['http://bbs.example.com/diy'] rules = ( Rule(LinkExtractor( …
eN_Joy
  • 853
  • 3
  • 11
  • 20
8
votes
2 answers

SplashRequest gives - TypeError: attrs() got an unexpected keyword argument 'eq'

I am using a cloud Splash instance from ScrapingHub. I am trying to do a simple request using the Scrapy-Splash library and I keep getting the error: @attr.s(hash=False, repr=False, eq=False) TypeError: attrs() got an unexpected keyword argument…
Ankur
  • 50,282
  • 110
  • 242
  • 312
8
votes
3 answers

how does scrapy-splash handle infinite scrolling?

I want to reverse engineering the contents generated by scrolling down in the webpage. The problem is in the url https://www.crowdfunder.com/user/following_page/80159?user_id=80159&limit=0&per_page=20&screwrand=933. screwrand doesn't seem to follow…
Bowen Liu
  • 99
  • 2
  • 7
7
votes
2 answers

How to load local HTML file in Scrapy Splash?

I want to load a local HTML file using Scrapy Splash and take save it as PNG/JPEG and then delete the HTML file script = """ splash:go(args.url) return splash:png() """ resp = requests.post('http://localhost:8050/run', json={ 'lua_source':…
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
7
votes
1 answer

How can I use Scrapy-Splash without Docker?

Is a way to use scrapy splash without docker. I mean, I have a server running with python3 without docker installed. And If possible I don't want to install docker on it. Also, what does exactly SPLASH_URL. Can I use only the IP of my server ? I…
7
votes
1 answer

scrapy-splash returns its own headers and not the original headers from the site

I use scrapy-splash to build my spider. Now what I need is to maintain the session, so I use the scrapy.downloadermiddlewares.cookies.CookiesMiddleware and it handles the set-cookie header. I know it handles the set-cookie header because i set…
Roman Smelyansky
  • 319
  • 1
  • 13
6
votes
2 answers

Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it

My steps: Build image docker build . -t scrapy Run a container docker run -it -p 8050:8050 --rm scrapy In container run scrapy project: scrapy crawl foobar -o allobjects.json This works locally, but on my production server I get…
Adam
  • 6,041
  • 36
  • 120
  • 208
6
votes
1 answer

Scrapy-Splash ERROR 400: "description": "Required argument is missing: url"

I'm using scrapy splash in my code to generate javascript-html codes. And splash is giving me back this render.html { "error": 400, "type": "BadOption", "description": "Incorrect HTTP API arguments", "info": { "type":…
6
votes
1 answer

How to send custom headers in a Scrapy Splash request?

My spider.py file is as so: def start_requests(self): for url in self.start_urls: yield scrapy.Request( url, self.parse, headers={'My-Custom-Header':'Custom-Header-Content'}, meta={ …
Nadun Perera
  • 565
  • 6
  • 15
6
votes
1 answer

Form Request Using Scrapy + Splash

I am trying to login to a website using the following code (slightly modified for this post): import scrapy from scrapy_splash import SplashRequest from scrapy.crawler import CrawlerProcess class Login_me(scrapy.Spider): name = 'espn' …
1
2 3
39 40