0

I am trying to scraping company review from Glassdoor using BeautifulSoup. However failed to extract anything from this site. I am using the code as follows-

        from requests import get
        from bs4 import BeautifulSoup
        url = "https://www.glassdoor.com/Reviews/The-Wonderful-Company-Reviews-E1005987_P2.htm? 
        sort.sortType=RD&sort.ascending=false"
        response = get(url)
        html_soup = BeautifulSoup(response.text, 'html.parser')
        html_soup

I am observing that the the above codes unable to extract anything and it is showing as- 'Bots not allowed'. I have shared the output below.

<!DOCTYPE html>
<html><head><title></title><style type="text/css">H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}.line {height: 1px; background-color: #525D76; border: none;}</style> </head><body><h1>HTTP Status 403 - Bots not allowed</h1><div class="line"></div><p><b>type</b> Status report</p><p><b>message</b> <u>Bots not allowed</u></p><p><b>description</b> <u>Access to the specified resource has been forbidden.</u></p><hr class="line"/><h3>Apache Tomcat</h3></body></html>

I am new in web scraping domain. Can anybody guide me the way how to extract the reviews from Glass door.

Sa. S
  • 21
  • 4
  • is it the whole code? because you never print any thing or saved If you run this code it will run without error and you can check your request response by typing print(response.status_code) – Haseeb Ahmed Aug 17 '20 at 17:14

1 Answers1

0

To get correct response from server, set User-Agent HTTP header:

from requests import get
from bs4 import BeautifulSoup


headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0'}

url = "https://www.glassdoor.com/Reviews/The-Wonderful-Company-Reviews-E1005987_P2.htm?sort.sortType=RD&sort.ascending=false"
response = get(url, headers=headers)
html_soup = BeautifulSoup(response.text, 'html.parser')
print(html_soup)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91