Chrome crashes after several hours while multiprocessing using Selenium through Python

Question

This is the error traceback after several hours of scraping:

The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.

This is my setup of selenium python:

#scrape.py
from selenium.common.exceptions import *
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options

def run_scrape(link):
    chrome_options = Options()
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument("--headless")
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument("--lang=en")
    chrome_options.add_argument("--start-maximized")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36")
    chrome_options.binary_location = "/usr/bin/google-chrome"
    browser = webdriver.Chrome(executable_path=r'/usr/local/bin/chromedriver', options=chrome_options)
    browser.get(<link passed here>)
    try:
        #scrape process
    except:
        #other stuffs
    browser.quit()

#multiprocess.py
import time,
from multiprocessing import Pool
from scrape import *

if __name__ == '__main__':
    start_time = time.time()
    #links = list of links to be scraped
    pool = Pool(20)
    results = pool.map(run_scrape, links)
    pool.close()
    print("Total Time Processed: "+"--- %s seconds ---" % (time.time() - start_time))

Chrome, ChromeDriver Setup, Selenium Version

ChromeDriver 79.0.3945.36 (3582db32b33893869b8c1339e8f4d9ed1816f143-refs/branch-heads/3945@{#614})
Google Chrome 79.0.3945.79
Selenium Version: 4.0.0a3

Im wondering why is the chrome is closing but other processes are working?

how to check binary version of chrome and chromedriver? and i need to run in headless mode since it will be run in linux server — Benjie Perez, Feb 12 '20 at 06:41
I have set up your code and it's working fine in local. are you facing issue in docker? — Manali Kagathara, Feb 12 '20 at 06:41
i haven't used docker, its just selenium python, chrome, and chromedriver — Benjie Perez, Feb 12 '20 at 06:43
might help https://chromedriver.chromium.org/help/chrome-doesn-t-start — Manali Kagathara, Feb 12 '20 at 06:45
Already tried, as you can see im passing --no-sandbox in the add_argument, i just dont know why chrome crashes — Benjie Perez, Feb 12 '20 at 06:51
yes, it is working fine for me.I'm using the chrome driver version and browser version as you are using. — Manali Kagathara, Feb 12 '20 at 07:02
for instance of this, can i revive the chrome in the same chromedriver if it crashes? — Benjie Perez, Feb 12 '20 at 07:04
you can do it. but it will take more time. check all structure because the same code it is working. — Manali Kagathara, Feb 12 '20 at 07:17

undetected Selenium · Accepted Answer · 2020-02-12T09:05:11.687

I took your code, modified it a bit to suit to my Test Environment and here is the execution results:

Code Block:

multiprocess.py:

import time
from multiprocessing import Pool
from multiprocessingPool.scrape import run_scrape

if __name__ == '__main__':
    start_time = time.time()
    links = ["https://selenium.dev/downloads/", "https://selenium.dev/documentation/en/"] 
    pool = Pool(2)
    results = pool.map(run_scrape, links)
    pool.close()
    print("Total Time Processed: "+"--- %s seconds ---" % (time.time() - start_time))

scrape.py:

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

def run_scrape(link):
    chrome_options = Options()
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument("--headless")
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument("--lang=en")
    chrome_options.add_argument("--start-maximized")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36")
    chrome_options.binary_location=r'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
    browser = webdriver.Chrome(executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe', options=chrome_options)
    browser.get(link)
    try:
        print(browser.title)
    except (NoSuchElementException, TimeoutException):
        print("Error")
    browser.quit()

Console Output:

Downloads
The Selenium Browser Automation Project :: Documentation for Selenium
Total Time Processed: --- 10.248600006103516 seconds ---

Conclusion

It is pretty much evident your program is logically flawless and just perfect.

This usecase

As you mentioned this error surfaces after several hours of scraping, I suspect this due to the fact that WebDriver is not thread-safe. Having said that, if you can serialize access to the underlying driver instance, you can share a reference in more than one thread. This is not advisable. But you can always instantiate one WebDriver instance for each thread.

Ideally the issue of thread-safety isn't in your code but in the actual browser bindings. They all assume there will only be one command at a time (e.g. like a real user). But on the other hand you can always instantiate one WebDriver instance for each thread which will launch multiple browsing tabs/windows. Till this point it seems your program is perfect.

Now, different threads can be run on same Webdriver, but then the results of the tests would not be what you expect. The reason behind is, when you use multi-threading to run different tests on different tabs/windows a little bit of thread safety coding is required or else the actions you will perform like click() or send_keys() will go to the opened tab/window that is currently having the focus regardless of the thread you expect to be running. Which essentially means all the test will run simultaneously on the same tab/window that has focus but not on the intended tab/window.

would it be the processing of my scraping affects if i share only one webdriver instance? and different workers with focus in each window? — Benjie Perez, Feb 12 '20 at 08:53
@BenjiePerez The bottom line is webdriver is not thread-safe. Now it all depends on you which measures are you going to take and how are you implementing them to serialize the access to the underlying webdriver instance. — undetected Selenium, Feb 12 '20 at 09:07
i see thanks for the information about the behavior of webdriver. — Benjie Perez, Feb 12 '20 at 09:14

score 0 · Answer 2 · answered Feb 13 '20 at 03:06

Right now im using this threading module to instantiate one Webdriver each thread

import threading
threadLocal = threading.local()

def get_driver():
    browser = getattr(threadLocal, 'browser', None)
    if browser is None:
        chrome_options = Options()
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument("--headless")
        chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument("--lang=en")
        chrome_options.add_argument("--start-maximized")
        chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
        chrome_options.add_experimental_option('useAutomationExtension', False)
        chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36")
        chrome_options.binary_location = "/usr/bin/google-chrome"
        browser = webdriver.Chrome(executable_path=r'/usr/local/bin/chromedriver', options=chrome_options)
        setattr(threadLocal, 'browser', browser)
    return browser

and it really helps me to scrape faster than executing one driver at a time.

my question to this is how would i exit this instantiate webdriver after pool.close()? because it remains sleeping in my server after the usage of my workers. — Benjie Perez, Feb 13 '20 at 05:17

Chrome crashes after several hours while multiprocessing using Selenium through Python

2 Answers2

Conclusion

This usecase

Linked

Related