3

So I've been scraping Glassdoor for company reviews for a while now. In the past, I had scraped the site pretty easily using one line of code:

page = http.get(url+ id + ".htm",timeout= 1000000,headers=HEADERS)

In fact, I didn't even need the "header" line! This code had worked wonders until I took about a 6-month break from the project. When I returned, instead of picking up right where I left off as I expected, every time I tried to request the webpage, I was returned <Response [403]> with the HTML correlating to a "security page". As a result, I have not been able to get any usable data from the website.

As this is quite the common occurrence, I scoured through MANY stack overflow questions and have implemented their suggestions. All of the following changes have been attempted and have not resulted in any success:

  1. Adding a 'user-agent' header containing only 'Mozilla/5.0'

  2. Adding a more complicated 'user-agent' header such as:

    'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'
    
  3. Adding a completely filled out header, based on this stack overflow question: Web Scraping getting error (HTTP Error 403: Forbidden) using urllib

  4. Adding only the 'user-agent' and 'Accept-Language' parameters, setting them equal to

    'en-US;q=0.7,en;q=0.3'
    
  5. Instead of using the httpx (http) library, replacing it with the requests library

  6. Instead of using requests.get(), using session and session.get() as session keeps track of cookies allow some websites' blockers to be bypassed

  7. And the last, desperate thing I tried was to use proxies based on this question: Web scraping results in 403 Forbidden Error and using the free random proxies from https://free-proxy-list.net/. I cycled through about 3 different addresses with varying 'specs'--none of them worked.

At this point, I have pretty much no leads on what thing sets off the red flags (perhaps my IP was flagged?) I have attached my code below in case that is helpful. Again, this is quite the new thing, just a few months ago everything was working smoothly...

url = "https://www.glassdoor.com/Reviews/-Reviews-E"

HEADERS = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Accept-Encoding':'gzip, deflate, br',
    'Accept-Language':'en-US,en;q=0.9',
    'Cache-Control': 'max-age=0',
    'Cookie': 'Here is where I copied the cookies from my browser, I looked through it and it contained some info that Might be able to personally identify me so I removed it from the post',
    'Sec-Ch-Ua': '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
    'Sec-Ch-Ua-Mobile':'?0',
    'Sec-Ch-Ua-Platform':"Windows",
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-User':'?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

if __name__ == "__main__":
    id = input("What Glassdoor company would you like to scrape [enter an id]: ")
    #getting "403 Forbidden "

    session = requests.Session()
    session.headers = HEADERS
    session.proxies = {'http':'209.141.62.12'}
    #this code returns 403 forbidden
    #page = http.get(url+ id + ".htm",timeout= 1000000,headers=HEADERS)
    page = session.get(url + id + ".htm",timeout = 100000,headers=HEADERS,proxies={'http':'209.141.62.12'})
    try:
        data = re.findall(r'apolloState":\s*({.+})};', page.text)[0]
    except IndexError:
        dir = create_log_directory("failed",id)
        logging.basicConfig(filename= dir+"info.log",encoding='utf-8',filemode='w',level=logging.DEBUG)
        logging.critical("failed")
        logging.debug(url + id + ".htm")
        logging.debug(page.text)
        sys.exit()

For context, you can assume all the logging functions work fine and that the line of code in the try will throw an IndexError only if I am returned an error or incorrect page. Additionally, I removed the 'Cookie' from the Cookie section of the header as I want to avoid possibly doxxing myself. However, it is worth noting that the 'Cookie' (and all other headers) came directly from my chrome browser (I visited the site and used chrome debug tools to identify my request headers). For testing purposes, you could replace the cookie with your own cookie.

I really hope someone has an idea for how to fix this, and if it works fine on anyone's local device, perhaps something about the request coming from me gives it away. I do hope my post is not too wordy and repetitive, but I really wanted to show the extent of what I tried and don't want this post to be closed (as either a duplicate or for 'lack of effort').

Finally, I want to mention that while the code above is not asynchronous, the solution MUST be applicable to use with some form of async. For example, previously I used the package aiohttp with asyncio to send multiple get requests simultaneously. While I am completely open to using new packages, I would desire that any such packages possess some async capabilities :) thanks again to anyone who took the time to read this!

Contone
  • 45
  • 4
  • I am currently experimenting with the answer to this question: https://stackoverflow.com/questions/73012176/python-web-scrapping-error-403-even-with-header-user-agent?rq=2. I will update this post if it ends up working, otherwise I'll keep trying other things. – Contone Jul 06 '23 at 16:58
  • I've attempted using cloudscraper, which works for a little while, but then the website seems to catch on and I am redirected to a Cloudflare page and cut off. So for now, this question still stands. – Contone Jul 06 '23 at 18:27
  • Can you provide some example of company id or full URL to review? – ands Jul 06 '23 at 18:41
  • Of course, Mcdonald's is 432. You will also need a profile id, which is different for each company. For McDonalds the profile id is 436. The URL I use to access the website is: https://www.glassdoor.com/Reviews/-Reviews-E432.htm which will then redirect to https://www.glassdoor.com/Reviews/McDonald-s-Reviews-E432.htm. I've also narrowed the issue down to something to do with CloudFlare, however, I have not had any luck bypassing it so far. – Contone Jul 06 '23 at 22:00
  • Yeah, Cloudflare prevents you from scraping pages. You could try to imitate a real user by using [Selenium](https://selenium-python.readthedocs.io/) or [Playwright](https://playwright.dev/python/), but I assume that even that could only work on a small scale because there is probably a limit to how many sites you can request before Cloudflare blocks you with a CAPTCHA or something like that. – ands Jul 07 '23 at 01:45
  • Do you know if there is a way to imitate a bunch of different real users through proxies around the world? Would this allow me to scale up my requests? While I don't need to make like 100 requests a second, I might need 20-50 requests a minute (optimally). In the past I did this through async, but if I have to create a bunch of real users using proxies, is it still possible? – Contone Jul 07 '23 at 01:58
  • I don't have much experience with this, I only scraped webpages that didn't block you, but if you used proxies it should work. The problem is that it probably won't work with free proxies. I tried a dozen free proxy websites, and most of them don't work, a few that do show an [error message](https://i.imgur.com/JcacCjY.png). You could try using the [Glassdoor API](https://www.glassdoor.com/developer/index.htm), but again, you would probably have to pay, and you need to be approved by Glassdoor. – ands Jul 08 '23 at 15:56
  • I can try again with proxies, though I'm a little skeptical that I could get it working on a larger scale. Additionally, I've actually contacted Glassdoor about their API--this is what they said, "Our apologies for any inconvenience this caused you. Our API sign up is currently closed until further notice." This pretty much forces me to scrape them as they haven't given me any other way to access their data. – Contone Jul 09 '23 at 00:52

1 Answers1

0

I found a solution that can bypass Cloudflare's protections, it is a Python module cloudscraper (which is a fork of cloudflare-scrape). It works on a small scale, but it says in the README that if you get reCAPTCHA challenge, then it won't be able to scrape the page. It is pretty simple to use:

import re
import cloudscraper

url = "https://www.glassdoor.com/Reviews/-Reviews-E"

HEADERS = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Accept-Encoding':'gzip, deflate, br',
    'Accept-Language':'en-US,en;q=0.9',
    'Cache-Control': 'max-age=0',
    'Cookie': 'Here is where I copied the cookies from my browser, I looked through it and it contained some info that Might be able to personally identify me so I removed it from the post',
    'Sec-Ch-Ua': '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
    'Sec-Ch-Ua-Mobile':'?0',
    'Sec-Ch-Ua-Platform':"Windows",
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-User':'?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

if __name__ == "__main__":
    id = input("What Glassdoor company would you like to scrape [enter an id]: ")
    #getting "403 Forbidden "

    scraper = cloudscraper.CloudScraper()
    #scraper.headers = HEADERS
    #scraper.proxies = {'http':'209.141.62.12'}
    page = scraper.get(url + id + '.htm', timeout = 100000)
    try:
        data = re.findall(r'apolloState":\s*({.+})};', page.text)[0]
    except IndexError as e:
        print(e)
        sys.exit()

You can set headers and proxies attributes just like with requests.Session() object, but it works without them.

ands
  • 1,926
  • 16
  • 27
  • Thank you for taking the time to write up a solution! But as you said, while it works well on the small scale, if I send too many requests using threading (or some other asynchronous technique), I'll get a reCaptcha challenge and will therefore be locked out. I probably should have made this a bigger part of the original post, but it is essential for this project to have the solution be scalable (or have some method/approach to make it so), otherwise, I'm stuck with waiting many hours to gather the data for a single company, or worse, being forced to take frequent breaks from scraping. – Contone Jul 10 '23 at 15:45
  • Yeah, I agree. Unfortunately, Cloudflare's is used to prevent you from scraping, and there is no easy way to get around the rate limitation. Maybe you could try combining using [cloudscraper](https://github.com/VeNoMouS/cloudscraper/) with [proxies](https://www.scrapehero.com/how-to-rotate-proxies-and-ip-addresses-using-python-3/). – ands Jul 10 '23 at 19:13
  • Since this question has not gotten any attention for around a month, and because it seems like there will not be any better answer for a while, I will accept this answer, at least until a better solution can be found (if one exists). – Contone Aug 07 '23 at 21:52