0

I am trying to do webscraping using Selenium in Chrome within Azure Databricks. Please find the below code.

%pip install selenium
%pip install webdriver_manager

from selenium import webdriver 
from selenium.webdriver import Chrome 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as ExpectedConditions
from selenium.webdriver.chrome.options import Options

# Specify the path to the uploaded chromedriver file
chrome_driver_path = '/dbfs/FileStore/Chromedriver/chromedriver'
chrome_service = Service(chrome_driver_path)

# Configure Chrome options
options = Options()
options.binary_location = "C:\Program Files\Google\Chrome\Application"

options.add_argument('--headless')  # Run Chrome in headless mode (without GUI)
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")

# Create a new Chrome webdriver instance
driver = webdriver.Chrome(service=chrome_service, options=options)

# Example usage: Open a website and print the page title
url = "https://data.cms.gov/tools/mapping-medicare-disparities-by-population"
driver.get(url)


# Clean up and quit the webdriver
driver.quit()

However I am getting below error - WebDriverException: Message: unknown error: no chrome binary at C:\Program Files\Google\Chrome\Application Stacktrace:

Abhishek Jain
  • 27
  • 1
  • 7
  • Databricks is a cloud application, you cannot refer to a local folder on your computer (c:\.program files\) – Chen Hirsh May 29 '23 at 14:19
  • @ChenHirsh Thanks. Then how can I specify the Binary location of chrome ? Should the Binary be added to dbfs i.e. delta lake ? – Abhishek Jain May 30 '23 at 05:40
  • @Abishek, that should probably be the way to go. I found this answer with details steps, hope it helps you: https://stackoverflow.com/questions/67830079/how-to-use-selenium-in-databricks-and-accessing-and-moving-downloaded-files-to-m – Chen Hirsh May 30 '23 at 05:55

2 Answers2

1

Try below code.

options = Options()
options.add_argument('--headless')
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")

enter image description here

Using shell command save the chromedriver version 113 in /tmp/chromedriver_linux64.zip.

%sh
wget  -N https://chromedriver.storage.googleapis.com/113.0.5672.63/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip

Unzip the file.

enter image description here

%sh
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver113/

Install chrome version 113.

%sh
sudo  curl  -sS  -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo  echo  "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable

enter image description here

Know get the data from url.

browser = webdriver.Chrome(service=Service('/tmp/chromedriver113/chromedriver'), options=options)
url = "https://data.cms.gov/tools/mapping-medicare-disparities-by-population"
browser.get(url)
browser.title

enter image description here

Follow this solution for more information.

JayashankarGS
  • 1,501
  • 2
  • 2
  • 6
0

Try the following:

%pip install selenium
%pip install webdriver_manager

from selenium import webdriver 
from selenium.webdriver import Chrome 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as ExpectedConditions
from selenium.webdriver import ChromeOptions

options = ChromeOptions()
options.add_argument('--headless')
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")

# Create a new Chrome webdriver instance
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options= options)

url = "https://data.cms.gov/tools/mapping-medicare-disparities-by-population"
driver.get(url)

driver.quit()

The issue with your code is that you are pointing the chrome driver to a Windows path (C:\Program Files\Google\Chrome\Application), which of course does not exist in the Databricks workspace.

  • I am getting error on below line # Create a new Chrome webdriver instance driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options= options) Error - WebDriverException: Message: unknown error: cannot find Chrome binary – Abhishek Jain Jun 08 '23 at 08:46
  • The webdriver_manager library was supposed to get the binaries for you. Are you using it? Here you can find more details on the lib: https://pypi.org/project/webdriver-manager/ – Herivelton Andreassa Jun 08 '23 at 22:41