I need to scrape many urls with Selenium
and Scrapy
. To speed up whole process, I'm trying to create a bunch of shared Selenium
instances. My idea is to have a set of parallel Selenium
instances available to any Request
if needed and released
if done.
I tried to create a Middleware
but the problem is that Middleware
is sequential (I see all drivers (I call it browsers) loading urls and it seems to be sequential). I want all drivers work parallel.
class ScrapySpiderDownloaderMiddleware(object):
BROWSERS_COUNT = 10
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.free_browsers = set(
[webdriver.Chrome(executable_path=BASE_DIR + '/chromedriver') for x in range(self.BROWSERS_COUNT)])
def get_free_browser(self):
while True:
try:
return self.free_browsers.pop()
except KeyError:
time.sleep(0.1)
def release_browser(self, browser):
self.free_browsers.add(browser)
def process_request(self, request, spider):
browser = self.get_free_browser()
browser.get(request.url)
body = str.encode(browser.page_source)
self.release_browser(browser)
# Expose the driver via the "meta" attribute
request.meta.update({'browser': browser})
return HtmlResponse(
browser.current_url,
body=body,
encoding='utf-8',
request=request
)
I don't like solutions where you do:
driver.get(response.url)
in parse
method because it causes redundant requests. Every url is being requested two times which I need to avoid.
For example this https://stackoverflow.com/a/17979285/2607447
Do you know what to do?