-1

Even though I'm using Proxy and User Agent rotation with Selenium Headless Chrome (and I've extracted the ip through https://free-proxy-list.net/ and TOR, and tested it with https://httpbin.org/ which displays the proxy ip and user agent expected so I know that that is supposed to be working), I'm still getting blocked at first try with a new IP and user agent at Glassdoor's main page "https://www.glassdoor.com/index.htm".

As context:

  • Being developed in a Docker container which is run locally
  • Using Headless Chrome with Selenium Python
  • Using Proxies extracted recently from https://free-proxy-list.net/ and also Selenium with rotating TOR proxy (both give the same results)
  • Using random User Agents from https://developers.whatismybrowser.com/useragents/explore/software_name/chrome/ which are consistent with Docker container OS and Browser specs (X11 and Chrome/6 or Chrome/7 so there's no JS display issues)
  • Scraping Glassdoor job postings. Other job websites work fine so it's Glassdoor specific.
  • It works fine if I use a free local VPN provider like ProtoVPN but this solution isn't scalable since the whole idea is not to spend money on this side project and make the collection automated (not a commercial product whatsoever, just want enough data to practice some NLP/Machine Learning)

This is the Chrome setup:

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-translate")
chrome_options.add_argument(f"--proxy-server={ip}")
chrome_options.add_argument(f"user-agent=[{random_user_agent()}]")

My theory is that Glassdoor is testing my Browser somehow and it's giving away that I'm using a proxy or that I have a setting that gives away that it's an automated Browser. Any ideas on what is happening?

EDIT: I've checked the possibility that Selenium is being detected but reality is that even though I'm using Selenium with Free proxies/TOR/VPN, it has no problem with scraping with a VPN active so it means that the issue must be at using a Proxy vs VPN so maybe someone can help me understand how that is happening.

  • 2
    I'm with @JeffC on this one. I'm uncomfortable with the question because it would seem your asking for advise on how to circumnavigate around a security feature. – Marcel Wilson Jul 16 '19 at 15:38
  • 2
    there are many ways to fingerprint a browser and detect selenium... you've only scratched the surface with IP and user-agent. However, you should just abide by their terms of use and stop scraping – Corey Goldberg Jul 16 '19 at 16:51

3 Answers3

2

I don't think it has anything to do with your IP address or browser agent. You are probably getting blocked because the site is trying to block scraping. See Can a website detect when you are using selenium with chromedriver?

jmq
  • 1,559
  • 9
  • 21
  • I've read the question you directed me to and even though it's very helpful to understand the limitations of Selenium, it still doesn't explain why it detects only when I'm using a free proxy, which I can assume they would block by extracting the same way I did, but by using TOR it's hard to explain the same way because there's no list of pre determined IP's AFAIK. Also, why am I not blocked while using a VPN? The answer should be more around the difference between using a Proxy or VPN. I assume they leave different signatures and some websites don't allow you to use proxies, I guess? – João Miguel Santos Jul 17 '19 at 09:58
1

I was able to access this page with that simple python script.

Maybe the website has fingerprinted the browser automated by selenium. Maybe try https://github.com/GoogleChrome/puppeteer or my script. Also free proxy are often of poor quality, maybe you could use a server or a paying proxy.

To pick the best user agent, you can use that libraby : https://github.com/Lobstrio/shadow-useragent

     1 import requests
     2 
     3 headers = {
     4     'authority': 'www.glassdoor.fr',
     5     'upgrade-insecure-requests': '1',
     6     'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
     7     'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
     8     'accept-encoding': 'gzip, deflate, br',
     9     'accept-language': 'fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7',
    10 }
    11 
    12 params = (
    13     ('countryRedirect', 'true'),
    14 )
    15 
    16 response = requests.get('https://www.glassdoor.fr/index.htm', headers=headers, params=params)
    17 print(response.content)

SimonR
  • 523
  • 1
  • 5
  • 17
0

Glassdoor has an API that you can access as a partner. (you'll need to contact them for access) It should provide you with everything you need without scraping the site.

https://www.glassdoor.com/developer/index.htm

Marcel Wilson
  • 3,842
  • 1
  • 26
  • 55