Scrapy with multiple Selenium instances (parallel)

Question

I need to scrape many urls with Selenium and Scrapy. To speed up whole process, I'm trying to create a bunch of shared Selenium instances. My idea is to have a set of parallel Selenium instances available to any Request if needed and released if done.

I tried to create a Middleware but the problem is that Middleware is sequential (I see all drivers (I call it browsers) loading urls and it seems to be sequential). I want all drivers work parallel.

class ScrapySpiderDownloaderMiddleware(object):
    BROWSERS_COUNT = 10

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.free_browsers = set(
            [webdriver.Chrome(executable_path=BASE_DIR + '/chromedriver') for x in range(self.BROWSERS_COUNT)])

    def get_free_browser(self):
        while True:
            try:
                return self.free_browsers.pop()
            except KeyError:
                time.sleep(0.1)

    def release_browser(self, browser):
        self.free_browsers.add(browser)

    def process_request(self, request, spider):
        browser = self.get_free_browser()

        browser.get(request.url)

        body = str.encode(browser.page_source)
        self.release_browser(browser)

        # Expose the driver via the "meta" attribute
        request.meta.update({'browser': browser})

        return HtmlResponse(
            browser.current_url,
            body=body,
            encoding='utf-8',
            request=request
        )

I don't like solutions where you do:

driver.get(response.url)

in parse method because it causes redundant requests. Every url is being requested two times which I need to avoid.

For example this https://stackoverflow.com/a/17979285/2607447

Do you know what to do?

Have you looked into Splash? It's a microservice which performs function you are looking for - it has N number of workers that act as middlemen and render your requests in webtkit. https://github.com/scrapinghub/splash — Granitosaurus, Jan 23 '19 at 02:55
@Granitosaurus Thank you, I'm trying to make it work and it works but I have one little problem. I'm afraid lua_script splash:go performs new request. I've posted a new question https://stackoverflow.com/questions/54327258/scrapy-splash-does-splashgourl-in-lua-script-perform-get-request-again — Milano, Jan 23 '19 at 12:33

score 0 · Answer 1 · answered Jan 23 '19 at 05:46

0

I suggest you look towards scrapy + docker. you can run many instances at once

answered Jan 23 '19 at 05:46

Иван Васильев

157
4

Milano · Answer 2 · 2019-01-24T15:59:16.407

0

As @Granitosaurus suggested, Splash is a good choice. I personally used Scrapy-splash - Scrapy takes care of parallel processing and Splash takes care of website rendering including JavaScript execution.

edited Jan 24 '19 at 15:59

answered Jan 24 '19 at 14:31

Milano

18,048
37
153
353

Scrapy with multiple Selenium instances (parallel)

2 Answers2