Even though I'm using Proxy and User Agent rotation with Selenium Headless Chrome (and I've extracted the ip through https://free-proxy-list.net/
and TOR, and tested it with https://httpbin.org/
which displays the proxy ip and user agent expected so I know that that is supposed to be working), I'm still getting blocked at first try with a new IP and user agent at Glassdoor's main page "https://www.glassdoor.com/index.htm".
As context:
- Being developed in a Docker container which is run locally
- Using Headless Chrome with Selenium Python
- Using Proxies extracted recently from
https://free-proxy-list.net/
and also Selenium with rotating TOR proxy (both give the same results) - Using random User Agents from
https://developers.whatismybrowser.com/useragents/explore/software_name/chrome/
which are consistent with Docker container OS and Browser specs (X11 and Chrome/6 or Chrome/7 so there's no JS display issues) - Scraping Glassdoor job postings. Other job websites work fine so it's Glassdoor specific.
- It works fine if I use a free local VPN provider like ProtoVPN but this solution isn't scalable since the whole idea is not to spend money on this side project and make the collection automated (not a commercial product whatsoever, just want enough data to practice some NLP/Machine Learning)
This is the Chrome setup:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-translate")
chrome_options.add_argument(f"--proxy-server={ip}")
chrome_options.add_argument(f"user-agent=[{random_user_agent()}]")
My theory is that Glassdoor is testing my Browser somehow and it's giving away that I'm using a proxy or that I have a setting that gives away that it's an automated Browser. Any ideas on what is happening?
EDIT: I've checked the possibility that Selenium is being detected but reality is that even though I'm using Selenium with Free proxies/TOR/VPN, it has no problem with scraping with a VPN active so it means that the issue must be at using a Proxy vs VPN so maybe someone can help me understand how that is happening.