I'm trying to extract some data using Selenium for a pet project of mine. I've already loaded up a few pages successfully and got their data, however this one site stops loading everytime I test it. Things I have tried:
- Using geckodriver with Firefox both headless & non-headless (headed??) versions
- Using chromedriver with Chrome both headless & non-headless versions
- Checked that pip3 & Selenium are all latest stable versions
- Opening Chrome with a user agent profile
- Opening Chrome with a random user agent profile (from random_user_agent library)
- Hardcoding waits for up to 30 seconds (time.sleep)
- Loading page in requests (in hindsight this was silly if I was looking for javascript - didn't work)
My theory is that they're blocking Selenium somehow, maybe this? But I have no way to test it. Issue does not persist when loading the page not using a Selenium browser instance (i.e. regular browser). My code below:
from selenium import webdriver
# requirements to wait until specific part of page is open
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--lang=en_US")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
options.add_argument("disable-infobars")
browser = webdriver.Chrome(options=options)
delay = 5
browser.get("https://shop.coles.com.au/a/alexandria/product/nutella-spread-chocolate-hazelnut-2620684p")
# this is where the page is not loading & therefore throwing ElementNotFound exception
try:
price_dollars = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CLASS_NAME, "price-dollars")))
price_cents = browser.find_element_by_class_name("price-cents")
# converts strings into floats with decimals (to one place only)
fl_price_dollars = float(price_dollars.text)
fl_price_cents = float(price_cents.text)
fl_price_concat = fl_price_dollars + fl_price_cents*10**-2
print(type(fl_price_concat)) # check this is a float type not string
print(fl_price_concat)
except TimeoutException:
print("Timeout1")
pass
except NoSuchElementException: # need to catch all exceptions & pass to quit() or processes will continue to run
print("Element not found")
pass
browser.quit()
Page source code that loads up when I open the page using Selenium browser instance:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<link rel="shortcut icon" href="about:blank">
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/j.js"></script>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/f.js"></script>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint/script/kpf.js?url=/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint&token=c2e6cd9a-e76e-cd51-288d-f604aea52023"></script>
</body>
</html>
EDIT the following answer worked for me in June 2020