4

I'm trying to extract some data using Selenium for a pet project of mine. I've already loaded up a few pages successfully and got their data, however this one site stops loading everytime I test it. Things I have tried:

  • Using geckodriver with Firefox both headless & non-headless (headed??) versions
  • Using chromedriver with Chrome both headless & non-headless versions
  • Checked that pip3 & Selenium are all latest stable versions
  • Opening Chrome with a user agent profile
  • Opening Chrome with a random user agent profile (from random_user_agent library)
  • Hardcoding waits for up to 30 seconds (time.sleep)
  • Loading page in requests (in hindsight this was silly if I was looking for javascript - didn't work)

The URL

My theory is that they're blocking Selenium somehow, maybe this? But I have no way to test it. Issue does not persist when loading the page not using a Selenium browser instance (i.e. regular browser). My code below:

from selenium import webdriver

# requirements to wait until specific part of page is open
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--lang=en_US")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
options.add_argument("disable-infobars")

browser = webdriver.Chrome(options=options)        
delay = 5
browser.get("https://shop.coles.com.au/a/alexandria/product/nutella-spread-chocolate-hazelnut-2620684p")
# this is where the page is not loading & therefore throwing ElementNotFound exception

try:
    price_dollars = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CLASS_NAME, "price-dollars")))
    price_cents = browser.find_element_by_class_name("price-cents")
    
    # converts strings into floats with decimals (to one place only)
    fl_price_dollars = float(price_dollars.text)
    fl_price_cents = float(price_cents.text)
    fl_price_concat = fl_price_dollars + fl_price_cents*10**-2
    print(type(fl_price_concat)) # check this is a float type not string
    print(fl_price_concat)
except TimeoutException:
    print("Timeout1")
    pass
except NoSuchElementException:  # need to catch all exceptions & pass to quit() or processes will continue to run
    print("Element not found")
    pass

browser.quit()

Page source code that loads up when I open the page using Selenium browser instance:


<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <link rel="shortcut icon" href="about:blank">
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/j.js"></script>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/f.js"></script>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint/script/kpf.js?url=/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint&token=c2e6cd9a-e76e-cd51-288d-f604aea52023"></script>
</body>
</html>

EDIT the following answer worked for me in June 2020

Heartthrob_Rob
  • 326
  • 4
  • 11
  • 1
    Looks like they're using [FingerPrint2](https://github.com/fingerprintjs/fingerprintjs2) which is blocking you, even with JS disabled, there seem to be other WAF mechanisms in place. They're actively blocking you from scraping their website. – Lucan Jun 26 '20 at 09:38
  • @Lucan thanks for that, how did you determine it was FingerPrint2? Also do you know how to circumnavigate this? – Heartthrob_Rob Jun 26 '20 at 09:48
  • 1
    It's in their sources, it's easier to spot when you're looking at the blocked page. I tried the basics to get around it much like yourself (UA, Proxy, Options), but I had no success. – Lucan Jun 26 '20 at 10:06
  • Looks like this will be a longer road than anticipated... thank you for the heads up, much appreciated ! – Heartthrob_Rob Jun 26 '20 at 10:13

1 Answers1

3

Try adding this argument: options.add_argument("--disable-blink-features=AutomationControlled")

The key is to make 'navigator.webdriver' return undefined. It returns 'true' if Chrome is controlled by the Webdriver (used by Selenium).

If you add this argument then the javascript invocation (you can test it in dev tools console) navigator.webdriver will return 'undefined' which is the same as if you run this in a regular Chrome.

Dharman
  • 30,962
  • 25
  • 85
  • 135
matteo84
  • 133
  • 8