Scrapy and Splash times out for a specific site

Question

I have an issue with Scrapy, Crawlera and Splash when trying the fetch responses from this site.

I tried the following without luck:

pure Scrapy shell - times out
Scrapy + Crawlera - times out
Scrapinghub Splash instance (small) - times out

However I can scrape the site with the Firefox webdriver of Selenium. But I want to move away from that and use Splash instead.

Is there a workaround to avoid these timeouts?

NOTE:

If I use local Splash instances set up by aquarium the site loads, though it still takes 20+ seconds compared to the Firefox webdriver's 10 seconds.

@DavidKong At the end found a way to scrape it using only requests without headless browsers. — Szabolcs, Nov 16 '19 at 07:25

score 0 · Answer 1 · answered Jan 18 '18 at 13:18

0

Try to increase the timeouts for Splash. If you run Splash using Docker, set the parameter --max-timeout to some bigger value, e.g. 3600 (for more info, look into documentation).

Next, in your Splash requests, also increate timeout. If you use scrapy-splash library, then set SplashRequest argument timeout to some higher value, e.g. 3600. Like this:

yield scrapy_splash.SplashRequest(
        url, self.parse, endpoint='execute',
        args={'lua_source': script, 'timeout': 3600})

answered Jan 18 '18 at 13:18

Tomáš Linhart

9,832
1
27
39

Yeah I can do that, but it's just not the right approach I think. When a site can be loaded in 3 seconds, why should I set a timeout of 3600 seconds? Are there another option to speed up splash? – Szabolcs Jan 18 '18 at 13:43
Just give it a try and see if it helps and solves the problem. If it does, you can always think about better approaches. Also, look at the documentation I linked in the answer, there's a whole section dedicated to such issues. – Tomáš Linhart Jan 18 '18 at 13:48
Well I can't seem to set max timeout for Splash instances hosted on Scrapinghub. Also I went through the docs like a thousand times, but for no good. Other than that I think that it's somehow related to some anti-scrape solution, as I just don't get it why it takes so long for Splash to render the page, maybe the site has some specific defense against it. Also I used various headers like: User-Agent, Referer to get different results but without luck. – Szabolcs Jan 18 '18 at 13:51
I'm sure there's a way, just contant them on [support](https://support.scrapinghub.com/support/home). – Tomáš Linhart Jan 18 '18 at 13:54
2

When you try to render the page using Splash web console, you can see the timings. There are also other tricks e.g. setting [`resource_timeout`](http://splash.readthedocs.io/en/stable/api.html#render-html) to speed up rendering process. – Tomáš Linhart Jan 18 '18 at 13:56
I do use the web console, so I can see the timings, and for other sites it's just okay, but with sites like this is problematic. Also `resource_timeout` is not the way to go, I already tried it, but according to my findings it's not for speeding up, but to set a default timeout value for each request. – Szabolcs Jan 18 '18 at 14:00

score 0 · Answer 2 · answered Jan 20 '18 at 16:04

You could retry the request with scrapy shell and setting the user agent in the headers. For me using this method on worked in a few seconds. Using the default user agent caused the connection to be dropped by the site. Default user agent declares that you're using scrapy, so it makes sense that the site would choose to drop the connection.

Replace custom user agent to match your own browser or preferred user agent, and url. You can try using the following steps, and then view the response in your browser:

scrapy shell
url = "https://www.yoururl.com"
request = scrapy.Request(url, headers={'User-Agent': 'custom user agent'})
fetch(request)
view(response)

Thanks for your answer. I already tried to set custom User-Agent headers in Splash, without luck. But I'll recheck it to be sure! — Szabolcs, Jan 20 '18 at 16:09

Scrapy and Splash times out for a specific site

NOTE:

2 Answers2