1

I'm trying to scrape one site which partially renders content using JS.

I went ahead and found this project: https://github.com/scrapinghub/sample-projects/tree/master/splash_smart_proxy_manager_example, which quite neatly explains how to set things out. Here's what I have right now:

Docker compose:

version: '3.8'

services:
    scraping:
        build:
            context: .
            dockerfile: Dockerfile
        volumes:
            - "./scraping:/scraping"
        environment:
            - PYTHONUNBUFFERED=1
        depends_on:
            - splash
        links:
            - splash
    splash:
        image: scrapinghub/splash
        restart: always
        expose:
            - 5023
            - 8050
            - 8051
        ports:
            - "5023:5023"
            - "8050:8050"
            - "8051:8051"

spider:

class HappySider(scrapy.Spider):
    ...
    custom_settings = {
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'ITEM_PIPELINES': {
            'scraping.pipelines.HappySpiderPipeline': 300,
        },
        'RETRY_HTTP_CODES': [500, 502, 503, 504, 522, 524, 408, 429, 403],
        'RETRY_TIMES': 20,
        'DOWNLOAD_DELAY': 5,
        'DOWNLOAD_TIMEOUT': 30,
        'CONCURRENT_REQUESTS': 1,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'COOKIES_ENABLED': False,
        'ROBOTSTXT_OBEY': True,
        # enable Zyte Proxy
        'ZYTE_SMARTPROXY_ENABLED': True,
        # the APIkey you get with your subscription
        'ZYTE_SMARTPROXY_APIKEY': '<my key>',
        'SPLASH_URL': 'http://splash:8050/',
    }

    def __init__(self, testing=False, name=None, **kwargs):
        self.LUA_SOURCE = get_data(
            'scraping', 'scripts/smart_proxy_manager.lua'
        ).decode('utf-8')
        super().__init__(name, **kwargs)

    def start_requests(self):

        yield SplashRequest(
            url='https://www.someawesomesi.te',
            endpoint='execute',
            args={
                'lua_source': self.LUA_SOURCE,
                'crawlera_user': self.settings['ZYTE_SMARTPROXY_APIKEY'],
                'timeout': 90,
            },
            # tell Splash to cache the lua script, to avoid sending it for every request
            cache_args=['lua_source'],
            meta={
                'max_retry_times': 10,
            },
            callback=self.my_callback
        )

And the output I get is:

2022-08-10 13:09:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.someawesomesi.te via http://splash:8050/execute> (failed 1 times): 504 Gateway Time-out

Not sure how to proceed here. I did look out why it would be giving 504 to me and splash docks does suggest some ways of handling it... but I don't have many concurrent URLs and the script fails with the very first one. Plus, the site I'm scraping is very fast, and if I just use Zyte without splash, then it scrapes very fast.

So If anybody can suggest what's wrong here and how to fix it - I'd greatly appreciate it.

zaki98
  • 1,048
  • 8
  • 13
Odif Yltsaeb
  • 5,575
  • 12
  • 49
  • 80
  • I think as long as the scraped site is not under your control you can't do anything about a site-error. Can you call the site in the browser? – David Aug 14 '22 at 05:03
  • Did you read that the site generated a 504 error? I read it as the 504 was something that splash resulted in. Because the site scraped without splash and with Zyte scrapes just fine. I only need scrapy-splash for rendering the JS. Otherwise the same scraper with zyte works just fine. – Odif Yltsaeb Aug 14 '22 at 09:45
  • 1
    @OdifYltsaeb I am a Developer Advocate at Zyte, Sergey just answered your question. If the issue still persists, feel free to reach out to support :) https://support.zyte.com/support/tickets/new – Neha Setia Nagpal Aug 15 '22 at 10:01

2 Answers2

3

Splash is getting deprecated soon. You can use headless browser libraries for rendering JS along with Smart Proxy Manager. Zyte recently launched three headless browser libraries.

  1. Zyte SmartProxy Puppeteer.
  2. Zyte SmartProxy Playwright.
  3. Zyte SmartProxy Selenium.

These client libraries are built on top of their native libraries for web automation across Chromium, Firefox, and WebKit, written to work seamlessly with Zyte Smart Proxy Manager. Using these library, you will no longer have to maintain a separate piece of software(like splash) running in the background to help connect with Zyte Smart Proxy Manager.

  1. My recommendation would be to use Zyte API. Zyte API is an end-to-end API solution that executes all tasks in the web-scraping sequence. It can extract dynamically-loaded web page content without spending time recreating what the browser does through JavaScript, headless browser libraries and additional requests.

For this particular solution, follow this documentation. Just Set javascript parameter: to

Turn JavaScript ON or OFF during browser rendering. And it just works...

I work as a Developer Advocate @Zyte.

  • I think I need to change the service then, right? From "Smart Proxy Manager" to "Smart Browser"? – Odif Yltsaeb Aug 16 '22 at 11:44
  • Or would that be "Automatic Extraction" service? Please point me towards the correct one I need to enable. – Odif Yltsaeb Aug 16 '22 at 13:04
  • It would be best to migrate from Smart Proxy Manager to Zyte API. You can use Zyte API with scrapy using this library. https://github.com/scrapy-plugins/scrapy-zyte-api – Neha Setia Nagpal Aug 16 '22 at 16:31
  • I have added the link to documentation... if you need help in migration let me know... Happy to help :) – Neha Setia Nagpal Aug 16 '22 at 16:34
  • Hey @Neha - i noiced the github link earlier too. I'm just having trouble understanding Which product from here https://www.zyte.com/pricing/ is that "Zyte API" or "zyte data API" none of the services on pricing pages seem like a 100% match. – Odif Yltsaeb Aug 17 '22 at 06:01
  • 1
    Hey @OdifYltsaeb, you need to choose [Smart Browser](https://www.zyte.com/smart-browser-api-anti-fingerprinting/) Smart Browser, Zyte Data API will be rechristened to Zyte API next month. Sorry for the confusion. I am sure Smart Browser with resolve this issue. Happy Scraping :) – Neha Setia Nagpal Aug 17 '22 at 07:17
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/247325/discussion-between-neha-setia-nagpal-and-odif-yltsaeb). – Neha Setia Nagpal Aug 17 '22 at 07:18
2

This example did not work out of the box for me either. Changing Zyte Smart Proxy Manager's port number specified in splash_smart_proxy_manager_example/scripts/smart_proxy_manager.lua to 8010 helped.

local port = 8010

8010 was used in the older example

Sergey Geron
  • 9,098
  • 2
  • 22
  • 29
  • This indeed got the scraping working, so It seems this is the right task to accept as solution, but the splash did not render results with rendered js data... which is the fault of splash - not this answer. But id rather accept an answer that helps someone along the most so I'll go ahead and check the tools that @Neha mentioned in her answer. Thank you Sergey! – Odif Yltsaeb Aug 16 '22 at 11:29