6

Quick description

I'm processing many pages with selenium sequentially but to improve the performance I've decided to parallelize the processing - split the pages between more threads (It can be done since the pages are independent from one another).

Here is the simplified code:

def process_page(driver, page, lock):
    driver.get("page.url()")
    driver.find_element_by_css_selector("some selector")
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "some selector")))
    .
    .
    .
    with lock:
        for i in range(result_tuple.__len__()):
            logger.info(result_tuple[i])
    return result_tuple

def process_all_pages():
    def pages_processing(id, lock):
        result = []
        with MyWebDriver(webdriver_options) as driver:
            for i in range(50):
                result.append(process_page(driver, pages[id * 50 + i], lock))
        return result

    lock = threading.Lock()

    with ThreadPoolExecutor(4) as executor:
        futures = []
        for i in range(4):
            futures.append(executor.submit(pages_processing, i, lock))

        result = []
        for i in range(futures.__len__()):
            result.append(futures[i].result())

    return result

MyWebDriver is just a simple context manager for Chrome driver, when entering context it spawns a new instance of the Chrome driver and when it exits the context, then it quits the given Chrome instance.

This code spawns 4 Chrome drivers separately for every thread and makes some selenium work in the Chrome drivers, also every thread separately.

The problem

For the first few seconds it works like a charm but after some time there start to be warnings in the logger and the Selenium seems to stop communicating with the Chrome drivers.

  • The same behavior appears with any number of threads except when it runs on a single thread.
  • The same behavior either running on Windows or Ubuntu

If needed I could also provide debug logs but not sure if there's something relevant.

The warnings in the logger:

...
# With these first warnings selenium stops to communicate with some Chrome drivers - just nothing happens in some of them.
WARNING - urllib3.connectionpool - Connection pool is full, discarding connection: 127.0.0.1
WARNING - urllib3.connectionpool - Connection pool is full, discarding connection: 127.0.0.1
...
# These warnings come a bit later
WARNING - urllib3.connectionpool - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000018343AB24A8>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/9c9fc148f278aaa360a26d95eac0966e/url
WARNING - urllib3.connectionpool - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000018348854E10>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/9c9fc148f278aaa360a26d95eac0966e/url
WARNING - urllib3.connectionpool - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000018348869710>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/9c9fc148f278aaa360a26d95eac0966e/url
...

Tested workarounds

I've tried these patches to set higher maxsize (HTTPConnectionPool, HTTPSConnectionPool) - https://stackoverflow.com/a/22253656/10580513 - this didn't fix the problem, btw. the patches were executed.

Next I've tried to set higher num_pools in the class PoolManager - I've changed this only in the sources and also the maxsize in the HTTPConnectionPool and HTTPSConnectionPool. This actually solved one issue - no warnings were in the log BUT the selenium communication with the driver got still frozen.

Community
  • 1
  • 1
xbalaj
  • 977
  • 1
  • 8
  • 14
  • This won't work. Use Puppeteer/Pyppeteer if you must have concurrency. – pguardiario Jan 28 '20 at 09:18
  • @pguardiario I think it is possible - in my example it is running separately in the threads. Some picks from conversations confirming the idea: https://stackoverflow.com/questions/30808606/can-selenium-use-multi-threading-in-one-browser https://groups.google.com/forum/#!msg/webdriver/cw_awztl-IM/pzxEwOUWnbMJ – xbalaj Jan 28 '20 at 09:49
  • Confirming the idea that Selenium is non-blocking/thread-safe? Sorry, I think you misunderstood what you read. – pguardiario Jan 28 '20 at 12:54
  • @pguardiario nope, confirming the idea that multiple instances can be run simultaneously on different threads. – xbalaj Jan 28 '20 at 16:50

0 Answers0