3

I will try to describe the problem with a sample code. Here is a code in C# that opens an instance of a Chrome browser and navigates to nseindia.com:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;

namespace nseindia_selenium
{
    class Program
    {
        static void Main( string [ ] args )
        {
            ChromeOptions options = new ChromeOptions ();
            options.BinaryLocation = "C:\\Users\\Subhasis\\AppData\\Local\\Chromium\\Application\\chrome.exe";
            //options.AddAdditionalCapability ( "w3c" , true );
            options.AddArgument ( "no-sandbox" );
            options.AddArgument ( "start-maximized" );
            options.AddArgument ( "disable-gpu");
            options.AddArgument ( "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36" );
            options.AddExcludedArgument ( "enable-automation" );
            options.AddAdditionalCapability ( "useAutomationExtension" , false );
            //options.AddAdditionalCapability ( "chrome.page.customHeaders.referrer" , "https://www.nseindia.com" );
            ChromeDriver chrome1 = new ChromeDriver (options);
            chrome1.Navigate ().GoToUrl ( "https://www.nseindia.com/" );
        }
    }
}

Things seem to be working this far (not really, all data fields appear empty). But at this stage,if I take manual control of the browser window and try to browse to any other part of nseindia.com, I get an error: enter image description here

At this point, even if I try to go back to the homepage of the website, it does not let me do it.

Past answers of this same question recommended manually setting the referrer. But when I do,

options.AddAdditionalCapability ( "chrome.page.customHeaders.referrer" , "https://www.nseindia.com" );

it is giving me "Invalid argument" error. Also, if it is a fault of referrer, that does not explain why manualy clicking a link does not work.

The past answers had also recommended using

options.AddAdditionalCapability ( "useAutomationExtension" , false );

But that does not work anymore because I get this message:

[1601022727.512][WARNING]: Deprecated chrome option is ignored: useAutomationExtension
[1601022727.512][WARNING]: Deprecated chrome option is ignored: useAutomationExtension

In the past this used to happen because of wrong w3c mode, but I can not switch w3c mode anymore either. when I put

options.AddAdditionalCapability ( "w3c" , true );

I get an error saying:

System.ArgumentException: 'There is already an option for the w3c capability. Please use the  instead.
Parameter name: capabilityName'

So what do I do?

Spero
  • 278
  • 3
  • 10
  • Well the answer depends on what's causing you to not have access? The things you've tried are potentially some of the things that cause this, but there are any number of other things. not least of which, this web page may have active counter measures to prevent it from getting crawled – Liam Sep 25 '20 at 10:47
  • @Liam It seems that the website uses Akamai bot detection. Any way to bypass it? – Spero Sep 25 '20 at 12:12
  • Quite possibly. People don't want to be crawled. Crawlers often steal computing power and information from web pages. I've worked on pages where we've tried to prevent people from crawling our site as it costs us money and the people crawling are trying to steal our business. If you have a legitimate reason to crawl this site I'd suggest you contact whoever owns it and ask for permission. – Liam Sep 25 '20 at 12:15
  • @Liam Unfortunately I am in no position to take or even suggest taking such high level decisions. My employer wants me to write code that can crawl this website and if I fail, that will be bad for my job. – Spero Sep 25 '20 at 12:41
  • Lets put it this, even if there was a way to bypass the bot detection. I'm not going to tell you what that is – Liam Sep 25 '20 at 13:00
  • 1
    [Well good luck](https://stackoverflow.com/jobs) – Liam Sep 25 '20 at 13:42
  • 1
    @Liam if you don't want the public to know things, then maybe you shouldn't run a public website. Keep it internal. Scraping is a part of free speech. If it's is too hard on your servers, maybe you should be a proper dev and write in 429 rate limiting and add the proper headers which includes what that rate is and how long to wait. Scraping is no different than 10 unique visitors copying/pasting what they see using a real browser. Don't want automated, then I send 10 employees to open up a real browser and load your website. Then what? Unless it's DDOS, you shouldn't do anything about it. – DataMinion Oct 16 '22 at 05:37

0 Answers0