0

I have extracted the items from a particular website and now want to write them to an .xls file.

I expected a full excel sheet with the headings and rows of information, but get a sheet with only the headings.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

res = requests.get('https://www.raywhite.com/contact/?type=People&target=people&suburb=Sydney%2C+NSW+2000&radius=50%27%27&firstname=&lastname=&_so=contact')
soup = bs(res.content, 'lxml')

names=[]
positions=[]
phone=[]
emails=[]
links=[]

nlist = soup.find_all('li', class_='agent-name')
plist= soup.find_all('li',class_='agent-role')
phlist = soup.find_all('li', class_='agent-officenum')
elist = soup.find_all('a',class_='val withicon')

for n1 in nlist:
    names.append(n1.text)
    links.append(n1.get('href'))
for p1 in plist:
    positions.append(p1.text)
for ph1 in phlist:
    phone.append(ph1.text)
for e1 in elist:
    emails.append(e1.get('href'))


df = pd.DataFrame(list(zip(names,positions,phone,emails,links)),columns=['Names','Position','Phone','Email','Link'])
df.to_excel(r'C:\Users\laptop\Desktop\RayWhite.xls', sheet_name='MyData2', index = False, header=True)

This is what the resulting DataFrame looks like:

enter image description here

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
ag2019
  • 105
  • 8
  • When I `print(soup)`, it says '429 Too Many Requests'. – Alex Hall Mar 28 '19 at 19:23
  • Yes and even while trying to print `print(res.content)` , it was not displaying the correct output. What can be the problem? I am a beginner to Python Programming so I was trying to practice web scraping and hope to get some suggestion. – ag2019 Mar 28 '19 at 19:34
  • Possible duplicate of [Returning 403 Forbidden from simple get](https://stackoverflow.com/questions/49542986/returning-403-forbidden-from-simple-get) – ivan_pozdeev Mar 28 '19 at 19:40
  • Okay thanks for suggesting. – ag2019 Mar 28 '19 at 19:55
  • Could you just explain me the significance of User Agent? – ag2019 Mar 28 '19 at 19:57

1 Answers1

0

I tried printing the results of your soup calls, for example nlist = soup.find_all('li', class_='agent-name') and am getting back empty arrays. The soup functions are not finding any data.

Looking further, the soup request is coming back empty:

soup = bs(res.content, 'lxml')
print(soup) 

gives:

<html>
<head><title>429 Too Many Requests</title></head>
<body bgcolor="white">
<center><h1>429 Too Many Requests</h1></center>
<hr/><center>nginx</center>
</body>
</html>

It looks like the site is detecting you as a bot and not allowing you to scrape. You can pretend you are a web browser by following the answer here: Web scraping with Python using BeautifulSoup 429 error

UPDATE:

Adding a user-agent to the request does the trick:

res = requests.get('https://www.raywhite.com/contact/?type=People&target=people&suburb=Sydney%2C+NSW+2000&radius=50%27%27&firstname=&lastname=&_so=contact', headers = {'User-agent': 'Super Bot 9000'})

You now get the desired output.

enter image description here

Some websites reject requests that have no user-agent, and it appears this site does so. Adding in a user-agent makes your request look more normal so the site allows it to go through. There isn't really any standard on this or anything, it varies site by site.

Salvatore
  • 10,815
  • 4
  • 31
  • 69
  • Yes that I have also checked while trying to just print the names,locations,... separately. I am not understanding whether the website is not allowing to scrape. Because while printing `print(res.content)` it was not giving any correct output. – ag2019 Mar 28 '19 at 19:31
  • Thanks a lot . Although it was mt first attempt, they were showing the error code 429. Could you just explain me the use of User agent that you added. I am new to Python programming and just practicing web scraping. – ag2019 Mar 28 '19 at 19:41
  • @AnkurGuha Some websites reject requests that have no user-agent, and it appears this site does so. Adding in a user-agent makes your request look more normal so the site allows it to go through. There isn't really any standard on this or anything, it varies site by site. I didn't know about this until I did a little research myself. Good luck! – Salvatore Mar 28 '19 at 19:44
  • Another thing what is the significance of Super Bot value and does it change from one website to another? – ag2019 Mar 28 '19 at 19:53
  • @AnkurGuha You could put any name you want as the `user-agent`. It just needs to not be empty or whatever the default from the request library is. This way the website doesn't filter out the request. – Salvatore Mar 28 '19 at 20:09