I'm running a Python Selenium script on Ubuntu 18.04 using Amazon EC2.
I have a list of urls, and using Selenium to loop through them and get info. Here's a very simple example of my script:
import requests
import selenium
from selenium import webdriver
from selenium import webdriver
from datetime import datetime as dt
import re
import time
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import ElementNotVisibleException
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
# set driver options
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--window-size=1420,1080')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("--disable-notifications")
chrome_options.add_argument("--remote-debugging-port=9222")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.add_experimental_option("excludeSwitches", ["disable-popup-blocking"])
chrome_options.binary_location='/usr/bin/google-chrome-stable'
chrome_driver_binary = "/usr/bin/chromedriver"
events = ['https://www.bandsintown.com/e/1024970351-alukah-at-high-noon-saloon?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event', 'https://www.bandsintown.com/e/103265416-bill-roberts-combo-at-come-back-in?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event', 'https://www.bandsintown.com/e/1022530728-chiiild-at-the-sylvee?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event', 'https://www.bandsintown.com/e/103243450-aimee-mann-at-stoughton-opera-house?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event', 'https://www.bandsintown.com/e/1022530781-leon-bridges-at-the-sylvee?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event', 'https://www.bandsintown.com/e/103338969-jonathan-coulton-at-stoughton-opera-house?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event', 'https://www.bandsintown.com/e/1024312079-necronomicon-at-high-noon-saloon?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event', 'https://www.bandsintown.com/e/103395650-jon.-at-jitters-coffeehouse?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event', 'https://www.bandsintown.com/e/1024311602-the-convalescence-at-high-noon-saloon?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event', 'https://www.bandsintown.com/e/1024691276-todd-sheaffer-at-the-bur-oak?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event', 'https://www.bandsintown.com/e/103127087-alec-benjamin-at-the-sylvee?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event', 'https://www.bandsintown.com/e/1024272496-sara-kays-at-the-sylvee?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event']
for event in events:
print("getting event")
driver = webdriver.Chrome(executable_path=chrome_driver_binary, chrome_options=chrome_options)
driver.get(event) #crashes!
time.sleep(3)
print(driver.title)
driver.quit()
I get:
getting event
Alukah Madison Tickets, High Noon Saloon May 02, 2022 | Bandsintown
getting event
Bill Roberts Combo Madison Tickets, Come Back In May 02, 2022 | Bandsintown
getting event
Chiiild Madison Tickets, The Sylvee May 02, 2022 | Bandsintown
getting event
Aimee Mann Stoughton Tickets, Stoughton Opera House May 02, 2022 | Bandsintown
getting event
Leon Bridges Madison Tickets, The Sylvee May 02, 2022 | Bandsintown
getting event
Traceback (most recent call last):
File "test.py", line 41, in <module>
driver.get(event) #crashes!
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 333, in get
self.execute(Command.GET, {'url': url})
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash
from unknown error: cannot determine loading status
from tab crashed
(Session info: headless chrome=101.0.4951.41)
What I've tried:
I've tried increasing the size of my Amazon EC2 instance. It's now got 16gb of memory, so I don't think that's the issue.
Tried countless tweaks to the
chrome_options
, including the--no-sandbox
and'--disable-dev-shm-usage'
options.Tried stopping/restarting instance.
Tried
pkill chrome
andpkill -f "(chrome)?(--headless)"
commands to the command line to make sure all chrome processes are killed. No luck.Ensured Chromedriver and google-chrome are same version. They are. (version 101.0.4951.41).
Ensured the location of Chrome/Chromedriver binaries are correct. google-chrome and chromedriver are both in
usr/bin
.Tested it locally - it works locally. Just fails on ubuntu EC2 instance.
Update
It seems like the issue is with this particular list of urls. For example, if I change the url list to this:
events = ["https://www.facebook.com",
"https://www.reddit.com",
"https://www.linkedin.com",
"https://yahoo.com",
"https://google.com",
"https://quora.com",
"https://sweetwater.com",
"https://amazon.com",
"https://youtube.com",
"https://github.com",
"https://stackoverflow.com",
"https://worpress.com",
"https://medium.com"]
The script seems to be run more reliably and I don't get a page crash.
So, what is it about the urls in my original question that are causing the crash? Could it be those urls have too much information that's overloading Selenium? If so, is there a way to load those urls more simply so they don't get bogged down with the map, popups, etc?
Update 2
I've tried two more solutions:
instead of:
driver = webdriver.Chrome(executable_path=chrome_driver_binary, chrome_options=chrome_options)
I used:
driver = webdriver.Chrome(executable_path=chrome_driver_binary, options=chrome_options)
I also added this argument:
chrome_options.add_experimental_option("prefs", { \
"profile.default_content_setting_values.media_stream_mic": 2,
"profile.default_content_setting_values.media_stream_camera": 2,
"profile.default_content_setting_values.geolocation": 2,
"profile.default_content_setting_values.notifications": 2
})
And it still crashes:
Alukah Madison Tickets, High Noon Saloon May 02, 2022 | Bandsintown
getting event
Bill Roberts Combo Madison Tickets, Come Back In May 02, 2022 | Bandsintown
getting event
Chiiild Madison Tickets, The Sylvee May 02, 2022 | Bandsintown
getting event
Traceback (most recent call last):
File "test.py", line 160, in <module>
driver.get(event) #crashes!
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 333, in get
self.execute(Command.GET, {'url': url})
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash
from tab crashed
(Session info: headless chrome=101.0.4951.41)