2

I'm using a Webdriver-based crawler to collect informtion from a chinese news website (Toutiao). Since 2020 Fed. 16, I found the site does not reponse any data to the chrome controlled by webdriver program, but chrome started manually work fine (as the below figure showed).

The left side is chrome started manually, the right side is chrome controlled by webdriver.

enter image description here

Two chromes is working on the same IP, and I have defined the same User-agent for the two chrome. Moreover, I use following codes (from DebanjanB) to remove "navigator.webdriver" (as shown in above figure, the code is successful):

options = webdriver.ChromeOptions() 
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'./chromedriver')
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
  "source": """
    Object.defineProperty(navigator, 'webdriver', {
      get: () => undefined
    })
  """
})

Summery, the same IP, User-agent, and "navigator.webdriver" is removed. Why does the website still detect my chrome is controlled by webdriver?

UPDATING

The website acquires content through access a url. If I copy and access the url (with encrypted parameters) from manully starting chrome to the webdriver controlled chrome, the server will send corrent informtiaon to webdriver.

So, the website definitly detect webdriver in generating the url and its encrypted parameters.

UPDATING 2

The disuss "Can a website detect when you are using selenium with chromedriver?" does not solve the problem, Please note !

Caspar
  • 31
  • 3
  • I'm using Chrome 80, and corresponding webdriver. I have readed the thread, I have try all approaches except re-compiling webdriver to change name of parameters ('cdc'?). I think the appraoch is based on old version of chrome. – Caspar Feb 18 '20 at 07:38
  • In addition to "navigator.webdriver", I find the two chromes have the same value of "navigator.plugins" – Caspar Feb 18 '20 at 07:39
  • The problem has been repeated in Ubuntu, Win 10 and Mac with different versions of chrome and chromedriver – Caspar Feb 18 '20 at 08:43
  • 1
    Unfortunately, a WebDriver-controlled browser and a normal browser do differ in a few ways which are probably detectable. Selenium/WebDriver are not interested in making a 100% undetectable robot, as that's not the reason why they exist. One approach, which I have used in the past, is to script the developer tools (debugger) of the browser instead, through remote debugging. This is undetectable, but it's not as nice as Selenium. – nneonneo Feb 18 '20 at 08:52
  • Thanks. I just want to know how does the site detect webdriver? or Can other tools such as Puppeteer or cdp4j are detected with the same approach? – Caspar Feb 18 '20 at 09:06
  • If you're referring to the `api/pc/feed` endpoint, yes, I noticed that too. The "signature" is generated by a routine from `acrawler.js`. Unfortunately, this is **heavily** obfuscated - there's two layers of normal JS obfuscation (a packer, then some non-standard Base64), and then an entire virtual machine (virtual CPU, with its own assembly language) implemented in JS with (what appear to be) special opcodes to access the DOM. – nneonneo Feb 18 '20 at 11:02
  • Thanks, I think the [link](https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver) does not solve my problem. I will try other approaches. – Caspar Feb 18 '20 at 11:34

1 Answers1

0

I have had luck with the following puppeteer script (not exactly headless, but could be of use hopefully)

'use strict'
/* Get puppeteer api */
const puppeteer = require('puppeteer')
const { TimeoutError } = require('puppeteer/Errors')
const ElementHandle = require('puppeteer/lib/JSHandle').ElementHandle
;(async () => {
    console.log('start')
    const browser = await puppeteer.launch({
        headless: false,
        defaultViewport: null,
        //product: "firefox",
        //userDataDir: '/Users/bartic/Library/Application Support/Chromium',
        ignoreHTTPSErrors: true,
        args: [
            '--disable-infobars',
            `--window-size=1900,1000`,
            `--window-position=100,0`,
        ],
        pipe: false,
        devtools: true
    })
    console.log('browser created')
    // Pass the Webdriver Test
    const page = await browser.newPage()
    await page.evaluateOnNewDocument(() => {
      delete navigator.__proto__.webdriver;
    });
    await page.goto('https://www.toutiao.com/')
    console.log('on page')
    await page.waitForNavigation()
})()