So I've been scraping Glassdoor for company reviews for a while now. In the past, I had scraped the site pretty easily using one line of code:
page = http.get(url+ id + ".htm",timeout= 1000000,headers=HEADERS)
In fact, I didn't even need the "header" line! This code had worked wonders until I took about a 6-month break from the project. When I returned, instead of picking up right where I left off as I expected, every time I tried to request the webpage, I was returned <Response [403]> with the HTML correlating to a "security page". As a result, I have not been able to get any usable data from the website.
As this is quite the common occurrence, I scoured through MANY stack overflow questions and have implemented their suggestions. All of the following changes have been attempted and have not resulted in any success:
Adding a 'user-agent' header containing only 'Mozilla/5.0'
Adding a more complicated 'user-agent' header such as:
'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'
Adding a completely filled out header, based on this stack overflow question: Web Scraping getting error (HTTP Error 403: Forbidden) using urllib
Adding only the 'user-agent' and 'Accept-Language' parameters, setting them equal to
'en-US;q=0.7,en;q=0.3'
Instead of using the httpx (http) library, replacing it with the requests library
Instead of using requests.get(), using session and session.get() as session keeps track of cookies allow some websites' blockers to be bypassed
And the last, desperate thing I tried was to use proxies based on this question: Web scraping results in 403 Forbidden Error and using the free random proxies from https://free-proxy-list.net/. I cycled through about 3 different addresses with varying 'specs'--none of them worked.
At this point, I have pretty much no leads on what thing sets off the red flags (perhaps my IP was flagged?) I have attached my code below in case that is helpful. Again, this is quite the new thing, just a few months ago everything was working smoothly...
url = "https://www.glassdoor.com/Reviews/-Reviews-E"
HEADERS = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Cookie': 'Here is where I copied the cookies from my browser, I looked through it and it contained some info that Might be able to personally identify me so I removed it from the post',
'Sec-Ch-Ua': '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
'Sec-Ch-Ua-Mobile':'?0',
'Sec-Ch-Ua-Platform':"Windows",
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User':'?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}
if __name__ == "__main__":
id = input("What Glassdoor company would you like to scrape [enter an id]: ")
#getting "403 Forbidden "
session = requests.Session()
session.headers = HEADERS
session.proxies = {'http':'209.141.62.12'}
#this code returns 403 forbidden
#page = http.get(url+ id + ".htm",timeout= 1000000,headers=HEADERS)
page = session.get(url + id + ".htm",timeout = 100000,headers=HEADERS,proxies={'http':'209.141.62.12'})
try:
data = re.findall(r'apolloState":\s*({.+})};', page.text)[0]
except IndexError:
dir = create_log_directory("failed",id)
logging.basicConfig(filename= dir+"info.log",encoding='utf-8',filemode='w',level=logging.DEBUG)
logging.critical("failed")
logging.debug(url + id + ".htm")
logging.debug(page.text)
sys.exit()
For context, you can assume all the logging functions work fine and that the line of code in the try will throw an IndexError only if I am returned an error or incorrect page. Additionally, I removed the 'Cookie' from the Cookie section of the header as I want to avoid possibly doxxing myself. However, it is worth noting that the 'Cookie' (and all other headers) came directly from my chrome browser (I visited the site and used chrome debug tools to identify my request headers). For testing purposes, you could replace the cookie with your own cookie.
I really hope someone has an idea for how to fix this, and if it works fine on anyone's local device, perhaps something about the request coming from me gives it away. I do hope my post is not too wordy and repetitive, but I really wanted to show the extent of what I tried and don't want this post to be closed (as either a duplicate or for 'lack of effort').
Finally, I want to mention that while the code above is not asynchronous, the solution MUST be applicable to use with some form of async. For example, previously I used the package aiohttp with asyncio to send multiple get requests simultaneously. While I am completely open to using new packages, I would desire that any such packages possess some async capabilities :) thanks again to anyone who took the time to read this!