10

I have a question about --headless mode in Python Selenium for Chrome.

Code

 from selenium import webdriver
 from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

 CHROME_DRIVER_DIR = "selenium/chromedriver"

 chrome_options = webdriver.ChromeOptions()
 caps = DesiredCapabilities().CHROME
 chrome_options.add_argument("--disable-dev-shm-usage")
 chrome_options.add_argument("--remote-debugging-port=9222")
 chrome_options.add_argument("--headless")  # Runs Chrome in headless mode.
 chrome_options.add_argument('--no-sandbox')  # # Bypass OS security model
 chrome_options.add_argument("--disable-extensions")
 chrome_options.add_argument("--disable-gpu")

 browser = webdriver.Chrome(desired_capabilities=caps, executable_path=CHROME_DRIVER_DIR, options=chrome_options)

 browser.get("https://www.manta.com/c/mm2956g/mashuda-contractors")
 print(browser.page_source)
 browser.quit()

When I'm remove chrome_options.add_argument("--headless") all working good, but with this --headless* got next issue

Please enable cookies.

Error 1020 Ray ID: 53fd62b4087d8116 • 2019-12-04 11:19:28 UTC

Access denied

What happened?
This website is using a security service to protect itself from online attacks.

Cloudflare Ray ID: 53fd62b4087d8116 • Your IP: 168.81.117.111 • Performance & security by Cloudflare

What is the difference for normal mode and --headless?

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352

4 Answers4

10

It's the HTTP User-Agent header that Cloudflare doesn't like.

To get around this issue, simply change your user-agent chrome option (below code is for Selenium in Python):

option.add_argument('--headless')
option.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36")
OwN
  • 1,248
  • 12
  • 17
6

I tested using this server-side script:

<?php
echo "<pre><code>";
var_dump($_SERVER);
echo "</code></pre>";
?>
<script>
    var el = document.getElementsByTagName('code')[0];
    for(var prop in window.navigator){
        var str = JSON.stringify(window.navigator[prop])
        el.innerHTML = el.innerHTML + "window.navigator." + prop + " = " + str + "\n";
    }
    var skip_props = ['parent', 'top', 'frames', 'self', 'window'];
    for(var prop in window){
        if (skip_props.indexOf(prop) > -1) { continue; }
        el.innerHTML = el.innerHTML + "window." + prop + " = ";
        var str = JSON.stringify(window[prop])
        el.innerHTML = el.innerHTML + str + "\n";
    }
</script>

I loaded this page using ChromeDriver, with and without using --headless, and printed the output using print(driver.find_element_by_tag_name('code').text). I then diff-ed both outputs.
Here's the differences I found:

  • HTTP Accept-Language header: en-US,en;q=0.9 vs en-US
  • HTTP User-Agent header: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 vs Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/83.0.4103.61 Safari/537.36 (Note the HeadlessChrome mention in the second string.)
  • Javascript window.navigator.plugins: {"0":{"0":{}},"1":{"0":{}},"2":{"0":{},"1":{}}} vs {}
  • Javascript window.navigator.mimeTypes: {"0":{},"1":{},"2":{},"3":{}} vs {}
  • Javascript window.outerWidth: 1367 vs 0
  • Javascript window.outerHeight: 641 vs 0

Of note: in the Python script you posted, you are missing a few lines, to remove the window.webdriver property (without this, it is trivial for the server to detect you are using WebDriver) [ref]:

driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    "source": """
    Object.defineProperty(navigator, 'webdriver', {
      get: () => undefined
    })
  """
})
Guillaume Boudreau
  • 2,676
  • 29
  • 27
3

I took your code, removed the optional arguments and added a few arguments to execute the test as follows:

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_argument("--headless")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get("https://www.manta.com/c/mm2956g/mashuda-contractors")
    print(driver.page_source)
    driver.quit()
    
  • Console Output:

    <html class="js" lang="en-US" style="opacity: 1; visibility: visible;"><!--<![endif]--><head>
    <title>Access denied | www.manta.com used Cloudflare to restrict access</title>
    <meta charset="UTF-8">
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1">
    <meta name="robots" content="noindex, nofollow">
    <meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1">
    <link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css" media="screen,projection">
    <!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
    <style type="text/css">body{margin:0;padding:0}</style>
    
    
    <!--[if gte IE 10]><!--><script type="text/javascript" src="/cdn-cgi/scripts/zepto.min.js"></script><!--<![endif]-->
    <!--[if gte IE 10]><!--><script type="text/javascript" src="/cdn-cgi/scripts/cf.common.js"></script><!--<![endif]-->
    
    
    
    </head>
    <body>
      <div id="cf-wrapper">
        <div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>
        <div id="cf-error-details" class="cf-error-details-wrapper">
          <div class="cf-wrapper cf-header cf-error-overview">
        <h1>
          <span class="cf-error-type" data-translate="error">Error</span>
          <span class="cf-error-code">1020</span>
          <small class="heading-ray-id">Ray ID: 53fd7c2fca12d5fc • 2019-12-04 11:36:52 UTC</small>
        </h1>
        <h2 class="cf-subheadline">Access denied</h2>
          </div><!-- /.header -->
    
          <section></section><!-- spacer -->
    
          <div class="cf-section cf-wrapper">
        <div class="cf-columns two">
          <div class="cf-column">
            <h2 data-translate="what_happened">What happened?</h2>
            <p>This website is using a security service to protect itself from online attacks.</p>
          </div>
    
    
        </div>
          </div><!-- /.section -->
    
          <div class="cf-error-footer cf-wrapper">
      <p>
        <span class="cf-footer-item">Cloudflare Ray ID: <strong>53fd7c2fca12d5fc</strong></span>
        <span class="cf-footer-separator">•</span>
        <span class="cf-footer-item"><span>Your IP</span>: 123.201.54.43</span>
        <span class="cf-footer-separator">•</span>
        <span class="cf-footer-item"><span>Performance &amp; security by</span> <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=error_footer" id="brand_link" target="_blank">Cloudflare</a></span>
    
      </p>
    </div><!-- /.error-footer -->
    
    
        </div><!-- /#cf-error-details -->
      </div><!-- /#cf-wrapper -->
    
      <script type="text/javascript">
      window._cf_translation = {};
    
    
    </script>
    
    
    
    </body></html>
    

Analysis

From the extracted page source it is pretty clear using --headless argument you are reaching to a page with:

  • Heading as: Access denied | www.manta.com used Cloudflare to restrict access.
  • Some information: What happened?: This website is using a security service to protect itself from online attacks.

Conclusion

The Browsing Context i.e. Chrome Browser session is getting detected as a BOT and the navigation is blocked.


Outro

You can find a couple of relevant discussions in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • `options.add_argument("--headless")` without this argument all is working fine on my end, so why `--headless` generate `Cloudflare` protection alert? All the same in code just one argument... Is any way to avoid it? Use `--headless` in the same as normal? – Максим Дихтярь Dec 04 '19 at 11:53
  • @МаксимДихтярь Checkout the updated answer and let me know the status. – undetected Selenium Dec 04 '19 at 11:59
  • You didn't answer the question, *why* is it blocked? – Guy Dec 04 '19 at 12:14
  • @Guy Perhaps you need to revisit the answer specifically the **Analysis** and**Conclusion** section. – undetected Selenium Dec 04 '19 at 12:21
  • 2
    @DebanjanB I did. The analysis section is in the question, just not in html format. The conclusion is correct, but doesn't explain why it works without `--headless`. – Guy Dec 04 '19 at 12:26
  • @Guy You still need to read the discussions in details which I have provided in the **Outro** section for a better understanding. – undetected Selenium Dec 04 '19 at 12:28
2

Cloudflare aims to block bots. They assume headless browser is used by data scrapers so they are blocking it. from Cloudflare What is Data Scraping?

*A headless browser is a type of web browser, much like Chrome or Firefox, but it doesn’t have a visual user interface by default, allowing it to move much faster than a typical web browser. By essentially running at the level of a command line, a headless browser is able to avoid rendering entire web applications. Data scrapers write bots that use headless browsers to request data more quickly, as there is no human viewing each page being scraped.

Guy
  • 46,488
  • 10
  • 44
  • 88
  • Blocking of Bots is not restricted to Cloudflare. It's a common practice among numerous anti scrapping sites. Besides Cloudflare, there is Distil, Akmai, etc – undetected Selenium Dec 04 '19 at 12:47
  • 2
    @DebanjanB Did I say it's unique to Cloudflare? the OP asked about specific site protected by Cloudflare. – Guy Dec 04 '19 at 12:49