3

I'm trying to scrape this page - https://www.g2.com/products/dropbox/reviews But I'm getting detected as soon as the request comes, is there a way around that?

Tried to use Request before that, and also getting detected. *I can't use Scrapy in this project. and I can't find proper info online on how to solve it...

Maybe I need to add custom headers?

the output of the code right now is (The title of the page that tells you that you are detected):

Pardon Our Interruption

Code:

from selenium import webdriver
import selenium as se

def fetch(URL):
    options = se.webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--disable-infobars')
    options.add_argument('--disable-extensions')
    options.add_argument('--profile-directory=Default')
    options.add_argument('--incognito')
    options.add_argument('--disable-plugins-discovery')
    options.add_argument('--start-maximized')
    driver = webdriver.Chrome('chromedriver',chrome_options=options)
    driver.get(URL)

    print(driver.title)


fetch('https://www.g2.com/products/dropbox/reviews')

EDIT: Was able to kind of go around, getting single page, but at a second run, getting detected. code:

def fetch(URL):

    firefox_profile = webdriver.FirefoxProfile()
    firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
    browser = webdriver.Firefox(executable_path='geckodriver.exe', firefox_profile=firefox_profile)
    browser.get(URL)
    print(browser.title)

fetch('https://www.g2.com/products/dropbox/reviews')
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Slava Bugz
  • 318
  • 5
  • 17
  • 2
    You can look [here](https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver) – Mert Köklü Dec 10 '19 at 10:51
  • Well, I was able to get a single page right now, but in the second run I get detected. probably a Proxy rotation could help. – Slava Bugz Dec 10 '19 at 11:14
  • 2
    Sometimes editing the useragent string to something more "normal" works. Selenium useragent is kinda weird. Though It's pretty clear this site is trying to stop the exact activity you are trying to perform hahaha – Rosey Dec 10 '19 at 15:22

1 Answers1

0

I took your code, made a few tweaks and executed the script with ChromeDriver / Chrome combo and encountered the similar issue i.e. the page with title as Pardon Our Interruption as follows:

  • Code Block:

    from selenium import webdriver
    
    options = webdriver.ChromeOptions() 
    options.add_argument('window-size=1200x600')
    options.add_argument('--headless')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get("https://www.g2.com/products/dropbox/reviews")
    print(driver.page_source)
    driver.quit()
    
  • Console Output:

    <html lang="zxx"><head>
        <title>Pardon Our Interruption</title>
        <link rel="stylesheet" type="text/css" href="//cdn.distilnetworks.com/css/distil.css" media="all">
        <meta http-equiv="content-type" content="text/html; charset=UTF-8">
        <meta name="viewport" content="width=1000">
        <meta name="robots" content="noindex, nofollow">
        <meta http-equiv="cache-control" content="max-age=0">
        <meta http-equiv="cache-control" content="no-cache">
        <meta http-equiv="expires" content="0">
        <meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT">
        <meta http-equiv="pragma" content="no-cache">
    
        <script type="text/javascript" async="" src="https://www.gstatic.com/recaptcha/releases/PRkVene3wKrZUWATSylf69ja/recaptcha__en.js"></script><script>
            function showBlockPage() {
            document.getElementsByClassName("container")[0].style.display = "";
            }
            setTimeout(showBlockPage, 10000);
        </script>
        <script type="text/javascript" src="/g2-meta-data" async="" defer=""></script>
        <script>if (window.sessionStorage) { sessionStorage.setItem('distil_referrer', document.referrer); }</script>
    
                <script src="https://www.google.com/recaptcha/api.js" async="" defer=""></script>
                <script>
                function solvedCaptcha(payload) {
                    const timeoutMs = 10000;
                    protectionSubmitCaptcha("recaptcha", payload, timeoutMs).then(function() {
                    window.location.reload(true);
                    });
                }
                </script>
    
        </head>
        <body class="block-page">
    
    
    
        <div class="container" style="">
            <script>document.getElementsByClassName("container")[0].style.display = "none";</script>
            <noscript>This page requires JavaScript!</noscript>
    
            <div class="row">
            <div class="sidebar col-lg-4 col-sm-5">
                <img src="//cdn.distilnetworks.com/images/anomaly-detected.png" alt="0">
            </div>
            <div class="content col-lg-8 col-sm-7">
                <h1>Pardon Our Interruption...</h1>
                <p>
                As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen:
                </p>
                <ul>
                <li>You're a power user moving through this website with super-human speed.</li>
                <li>You've disabled JavaScript and/or cookies in your web browser.</li>
                <li>A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this <a title="Third party browser plugins that block javascript" href="http://ds.tl/help-third-party-plugins" target="_blank">support article</a>.</li>
                </ul>
    
    
                <script>showBlockPage()</script>
    
                <p>After completing the CAPTCHA below, you will immediately regain access to the site again.</p>
    
            <div class="g-recaptcha" data-sitekey="6LcfNLkUAAAAALPSa4GI_zHIPcYVGlxNOdvMsUsh" data-callback="solvedCaptcha"><div style="width: 304px; height: 78px;"><div><iframe src="https://www.google.com/recaptcha/api2/anchor?ar=1&amp;k=6LcfNLkUAAAAALPSa4GI_zHIPcYVGlxNOdvMsUsh&amp;co=aHR0cHM6Ly93d3cuZzIuY29tOjQ0Mw..&amp;hl=en&amp;v=PRkVene3wKrZUWATSylf69ja&amp;size=normal&amp;cb=m8amuk5fpfe" width="304" height="78" role="presentation" name="a-x8exk2gk39a9" frameborder="0" scrolling="no" sandbox="allow-forms allow-popups allow-same-origin allow-scripts allow-top-navigation allow-modals allow-popups-to-escape-sandbox"></iframe></div><textarea id="g-recaptcha-response" name="g-recaptcha-response" class="g-recaptcha-response" style="width: 250px; height: 40px; border: 1px solid rgb(193, 193, 193); margin: 10px 25px; padding: 0px; resize: none; display: none;"></textarea></div></div>
            </div>
            </div>
        </div>
    
    
    
    <div id="d__fFH" style="position: absolute !important; top: -5000px !important; left: -5000px !important;"><object id="d_dlg" classid="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></object><span id="d__fF" style="font-family: ZWAdobeF, serif !important; font-size: 72px !important; visibility: hidden;">mmmmmmmmlli</span></div><div style="background-color: rgb(255, 255, 255); border: 1px solid rgb(204, 204, 204); box-shadow: rgba(0, 0, 0, 0.2) 2px 2px 3px; position: absolute; transition: visibility 0s linear 0.3s, opacity 0.3s linear 0s; opacity: 0; visibility: hidden; z-index: 2000000000; left: 0px; top: -10000px;"><div style="width: 100%; height: 100%; position: fixed; top: 0px; left: 0px; z-index: 2000000000; background-color: rgb(255, 255, 255); opacity: 0.05;"></div><div class="g-recaptcha-bubble-arrow" style="border: 11px solid transparent; width: 0px; height: 0px; position: absolute; pointer-events: none; margin-top: -11px; z-index: 2000000000;"></div><div class="g-recaptcha-bubble-arrow" style="border: 10px solid transparent; width: 0px; height: 0px; position: absolute; pointer-events: none; margin-top: -10px; z-index: 2000000000;"></div><div style="z-index: 2000000000; position: relative;"><iframe title="recaptcha challenge" src="https://www.google.com/recaptcha/api2/bframe?hl=en&amp;v=PRkVene3wKrZUWATSylf69ja&amp;k=6LcfNLkUAAAAALPSa4GI_zHIPcYVGlxNOdvMsUsh&amp;cb=yl5twmy9lj55" name="c-x8exk2gk39a9" frameborder="0" scrolling="no" sandbox="allow-forms allow-popups allow-same-origin allow-scripts allow-top-navigation allow-modals allow-popups-to-escape-sandbox" style="width: 100%; height: 100%;"></iframe></div></div></body></html>
    

Analysis

On inspecting the page you will find the <body> tag contains:

<script>window.distilReferrerValue = function() {
  var value;

  try {
    if (window.sessionStorage) {
      value = sessionStorage.getItem('distil_referrer');
      sessionStorage.removeItem('distil_referrer');
    }
  } catch(e) {}

  window.distilReferrerValue = function() {
    return value;
  };
  return value;
};</script>

Which is a clear indication that the website https://www.g2.com/products/dropbox/reviews is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.


Distil

As per the article There Really Is Something About Distil.it...:

Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.

Further,

"One pattern with **Selenium** was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


Reference

You can find a relevant discussion in Chrome browser initiated through ChromeDriver gets detected

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352