27

Hoping an expert can help me with a Selenium/Cloudflare mystery. I can get a website to load in normal (non-headless) Selenium, but no matter what I try, I can't get it to load in headless.

I have followed the suggestions from the StackOverflow posts like Is there a version of Selenium WebDriver that is not detectable?. I've also looked at all the properties of window and window.navigator objects and fixed all the diffs between headless and non-headless, but somehow headless is still being detected. At this point I am extremely curious how Cloudflare could possibly figure out the difference. Thank you for the time!

List of the things I have tried:

  • User-agent
  • Replace cdc_ with another string in chromedriver
  • options.add_experimental_option("excludeSwitches", ["enable-automation"])
  • options.add_experimental_option('useAutomationExtension', False)
  • options.add_argument('--disable-blink-features=AutomationControlled') (this was necessary to get website to load in non-headless)
  • Set navigator.webdriver = undefined
  • Set navigator.plugins, navigator.languages, and navigator.mimeTypes
  • Set window.ScreenY, window.screenTop, window.outerWidth, window.outerHeight to be nonzero
  • Set window.chrome and window.navigator.chrome
  • Set width and height of images to be nonzero
  • Set WebGL parameters
  • Fix Modernizr

Replicating the experiment

In order to get the website to load in normal (non-headless) Selenium, you have to follow a _blank link from another website (so that the target website opens in another tab). To replicate the experiment, first create an html file with the content <a href="https://poocoin.app" target="_blank">link</a>, and then paste the path to this html file in the following code.

The version below (non-headless) runs fine and loads the website, but if you set options.headless = True, it will get stuck on Cloudflare.

from selenium import webdriver
import time

# Replace this with the path to your html file
FULL_PATH_TO_HTML_FILE = 'file:///Users/simplepineapple/html/url_page.html'

def visit_website(browser):
    browser.get(FULL_PATH_TO_HTML_FILE)
    time.sleep(3)

    links = browser.find_elements_by_xpath("//a[@href]")
    links[0].click()
    time.sleep(10)

    # Switch webdriver focus to new tab so that we can extract html
    tab_names = browser.window_handles
    if len(tab_names) > 1:
        browser.switch_to.window(tab_names[1])

    time.sleep(1)
    html = browser.page_source
    print(html)
    print()
    print()

    if 'Charts' in html:
        print('Success')
    else:
        print('Fail')

    time.sleep(10)


options = webdriver.ChromeOptions()
# If options.headless = True, the website will not load
options.headless = False
options.add_argument("--window-size=1920,1080")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36')

browser = webdriver.Chrome(options = options)

browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
    "source": '''
    Object.defineProperty(navigator, 'webdriver', {
        get: () => undefined
    });
    Object.defineProperty(navigator, 'plugins', {
            get: function() { return {"0":{"0":{}},"1":{"0":{}},"2":{"0":{},"1":{}}}; }
    });
    Object.defineProperty(navigator, 'languages', {
        get: () => ["en-US", "en"]
    });
    Object.defineProperty(navigator, 'mimeTypes', {
        get: function() { return {"0":{},"1":{},"2":{},"3":{}}; }
    });

    window.screenY=23;
    window.screenTop=23;
    window.outerWidth=1337;
    window.outerHeight=825;
    window.chrome =
    {
      app: {
        isInstalled: false,
      },
      webstore: {
        onInstallStageChanged: {},
        onDownloadProgress: {},
      },
      runtime: {
        PlatformOs: {
          MAC: 'mac',
          WIN: 'win',
          ANDROID: 'android',
          CROS: 'cros',
          LINUX: 'linux',
          OPENBSD: 'openbsd',
        },
        PlatformArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        PlatformNaclArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        RequestUpdateCheckStatus: {
          THROTTLED: 'throttled',
          NO_UPDATE: 'no_update',
          UPDATE_AVAILABLE: 'update_available',
        },
        OnInstalledReason: {
          INSTALL: 'install',
          UPDATE: 'update',
          CHROME_UPDATE: 'chrome_update',
          SHARED_MODULE_UPDATE: 'shared_module_update',
        },
        OnRestartRequiredReason: {
          APP_UPDATE: 'app_update',
          OS_UPDATE: 'os_update',
          PERIODIC: 'periodic',
        },
      },
    };
    window.navigator.chrome =
    {
      app: {
        isInstalled: false,
      },
      webstore: {
        onInstallStageChanged: {},
        onDownloadProgress: {},
      },
      runtime: {
        PlatformOs: {
          MAC: 'mac',
          WIN: 'win',
          ANDROID: 'android',
          CROS: 'cros',
          LINUX: 'linux',
          OPENBSD: 'openbsd',
        },
        PlatformArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        PlatformNaclArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        RequestUpdateCheckStatus: {
          THROTTLED: 'throttled',
          NO_UPDATE: 'no_update',
          UPDATE_AVAILABLE: 'update_available',
        },
        OnInstalledReason: {
          INSTALL: 'install',
          UPDATE: 'update',
          CHROME_UPDATE: 'chrome_update',
          SHARED_MODULE_UPDATE: 'shared_module_update',
        },
        OnRestartRequiredReason: {
          APP_UPDATE: 'app_update',
          OS_UPDATE: 'os_update',
          PERIODIC: 'periodic',
        },
      },
    };
    ['height', 'width'].forEach(property => {
        const imageDescriptor = Object.getOwnPropertyDescriptor(HTMLImageElement.prototype, property);

        // redefine the property with a patched descriptor
        Object.defineProperty(HTMLImageElement.prototype, property, {
            ...imageDescriptor,
            get: function() {
                // return an arbitrary non-zero dimension if the image failed to load
            if (this.complete && this.naturalHeight == 0) {
                return 20;
            }
                return imageDescriptor.get.apply(this);
            },
        });
    });

    const getParameter = WebGLRenderingContext.getParameter;
    WebGLRenderingContext.prototype.getParameter = function(parameter) {
        if (parameter === 37445) {
            return 'Intel Open Source Technology Center';
        }
        if (parameter === 37446) {
            return 'Mesa DRI Intel(R) Ivybridge Mobile ';
        }

        return getParameter(parameter);
    };

    const elementDescriptor = Object.getOwnPropertyDescriptor(HTMLElement.prototype, 'offsetHeight');

    Object.defineProperty(HTMLDivElement.prototype, 'offsetHeight', {
        ...elementDescriptor,
        get: function() {
            if (this.id === 'modernizr') {
            return 1;
            }
            return elementDescriptor.get.apply(this);
        },
    });
    '''
})

visit_website(browser)

browser.quit()
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
simplepineapple
  • 279
  • 1
  • 3
  • 3
  • Are you talking about "I'm under attack mode"? That will run some some js tests that you won't be able to spoof (timing drawing things on canvas maybe?). – pguardiario Jul 08 '21 at 02:34
  • Thank you for the detailed description of how to make things work in a non-headless mode. I have reproduced your experiment and get exactly the same behaviour. I don't have answer to your question, but perhaps you, like myself, can use some virtual framebuffer device to simulate non-headless mode. For me Xvnc worked, I used it because I want to have a chance to observe the process anyway. Perhaps you can get away with more lightweight Xvfb. – abb Oct 29 '21 at 11:32

5 Answers5

28

Using the latest Google Chrome v96.0 if you retrive the useragent

  • For the browser the following is in use:

    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36
    
  • Where as for browser the following is in use:

    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/96.0.4664.110 Safari/537.36
    

In majority of the cases the presence of the additional Headless string/parameter/attribute is intercepted as a and blocks the access to the website.


Solution

There are different approaches to evade the Cloudflare detection even using Chrome in mode and some of the efficient approaches are as follows:

  • An efficient solution would be to use the undetected-chromedriver to initialize the Chrome Browsing Context. undetected-chromedriver is an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io. It automatically downloads the driver binary and patches it.

    • Code Block:

      import undetected_chromedriver as uc
      from selenium import webdriver
      
      options = webdriver.ChromeOptions() 
      options.headless = True
      options.add_argument("start-maximized")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = uc.Chrome(options=options)
      driver.get('https://bet365.com')
      

You can find a couple of relevant detailed discussions in:

  • The most efficient solution would be to use Selenium Stealth to initialize the Chrome Browsing Context. selenium-stealth is a python package to prevent detection. This programme tries to make python selenium more stealthy.

    • Code Block:

      from selenium import webdriver
      from selenium_stealth import stealth
      
      options = webdriver.ChromeOptions()
      options.add_argument("start-maximized")
      options.add_argument("--headless")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = webdriver.Chrome(options=options, executable_path=r"C:\path\to\chromedriver.exe")
      
      stealth(driver,
              languages=["en-US", "en"],
              vendor="Google Inc.",
              platform="Win32",
              webgl_vendor="Intel Inc.",
              renderer="Intel Iris OpenGL Engine",
              fix_hairline=True,
              )
      
      driver.get("https://bot.sannysoft.com/")
      

You can find a couple of relevant detailed discussions in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Thank you, seems Cloudflare was detecting headless chrome and flagging the site in my case, have since changed the user-agent, though would have preferred to use the default one – Richard Muvirimi Apr 04 '22 at 09:39
2

@undetected Selenium's answer works perfectly with https://github.com/diprajpatra/selenium-stealth

If you are using the latest version of selenium, you will need to change executable_path parameter as it's depreciated, example code:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium_stealth import stealth

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)

stealth(driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
)

driver.get("https://bot.sannysoft.com/")

print(driver.find_element(By.XPATH, "/html/body").text)

driver.close()
Den Pat
  • 1,118
  • 10
  • 17
1

The only thing I can suggets in addition - to improove your plugins and mime types for navigator sometimes can be use property as typeof(navigator.plugins, 'PluginsArray')

Object.defineProperty(navigator, 'plugins', {
    get: () => {
        var ChromiumPDFPlugin = {};
        var plugin = {
            ChromiumPDFPlugin,
            description: 'Portable Document Format',
            filename: 'internal-pdf-viewer',
            length: 1,
            name: 'Chromium PDF Plugin',

        };
        plugin.__proto__ = Plugin.prototype;

        var plugins = {
            0: plugin,
            length: 1
        };
        plugins.__proto__ = PluginArray.prototype;
        return plugins;
    },
});

Object.defineProperty(navigator, 'mimeTypes', {
    get: () => {
        var mimeType = {
            type: 'application/pdf',
            suffixes: 'pdf',
            description: 'Portable Document Format',
            enabledPlugin: Plugin

        };
        mimeType.__proto__ = MimeType.prototype;

        var mimeTypes = {
            0: mimeType,
            length: 1
        };
        mimeTypes.__proto__ = MimeTypeArray.prototype;
        return mimeTypes;
    },
});

Good website to check what's going wrong in headless mode is https://bot.sannysoft.com/

You can run in headless mode and create page snapshot to check if all passed

P.s. also, sometimes, even if navigator.webdriver is set to undefined, navigator still contains webdriver prop You can simply rm using code below:

const newProto = navigator.__proto__;
delete newProto.webdriver;
navigator.__proto__ = newProto;
Nikita
  • 29
  • 3
0

The cloudflare protection IUAM is used primary to avoid ddos attacks and for consequence it also protect sites from automation bot exploitation so no matter what you are using in the client side the cloudflare server is fingerprinting you. After that they send to the client side the cf_clearance a cookie that allows you to connect for the next 15 minutes.

enter image description here

Franz Kurt
  • 1,020
  • 2
  • 14
  • 14
  • I noticed the cf_clearance cookie is used to bypass the CAPTCHA once validated but even if I reuse this cookie in my WebDriver script, it is still asking me to complete the CAPTCHA while it is still a valid cookie in Firefox without WebDriver. The user agent is the same, so they are checking something else, maybe navigator.webdriver JavaScript variable? – baptx Jul 22 '23 at 11:55
-1

pip install undetected-chromedriver

You can use this module

  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 15 '23 at 12:29
  • The top-voted answer from a year and a half before this answer already suggests installing `undetected-chromedriver`. Please don't repeat answers. – ChrisGPT was on strike Aug 28 '23 at 17:49