Getting no data from the server when scraping a site

Question

I have extracted the items from a particular website and now want to write them to an .xls file.

I expected a full excel sheet with the headings and rows of information, but get a sheet with only the headings.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

res = requests.get('https://www.raywhite.com/contact/?type=People&target=people&suburb=Sydney%2C+NSW+2000&radius=50%27%27&firstname=&lastname=&_so=contact')
soup = bs(res.content, 'lxml')

names=[]
positions=[]
phone=[]
emails=[]
links=[]

nlist = soup.find_all('li', class_='agent-name')
plist= soup.find_all('li',class_='agent-role')
phlist = soup.find_all('li', class_='agent-officenum')
elist = soup.find_all('a',class_='val withicon')

for n1 in nlist:
    names.append(n1.text)
    links.append(n1.get('href'))
for p1 in plist:
    positions.append(p1.text)
for ph1 in phlist:
    phone.append(ph1.text)
for e1 in elist:
    emails.append(e1.get('href'))


df = pd.DataFrame(list(zip(names,positions,phone,emails,links)),columns=['Names','Position','Phone','Email','Link'])
df.to_excel(r'C:\Users\laptop\Desktop\RayWhite.xls', sheet_name='MyData2', index = False, header=True)

This is what the resulting DataFrame looks like:

Yes and even while trying to print `print(res.content)` , it was not displaying the correct output. What can be the problem? I am a beginner to Python Programming so I was trying to practice web scraping and hope to get some suggestion. — ag2019, Mar 28 '19 at 19:34
Possible duplicate of [Returning 403 Forbidden from simple get](https://stackoverflow.com/questions/49542986/returning-403-forbidden-from-simple-get) — ivan_pozdeev, Mar 28 '19 at 19:40

Salvatore · Accepted Answer · 2019-03-28T19:45:14.377

0

I tried printing the results of your soup calls, for example nlist = soup.find_all('li', class_='agent-name') and am getting back empty arrays. The soup functions are not finding any data.

Looking further, the soup request is coming back empty:

soup = bs(res.content, 'lxml')
print(soup)

gives:

<html>
<head><title>429 Too Many Requests</title></head>
<body bgcolor="white">
<center><h1>429 Too Many Requests</h1></center>
<hr/><center>nginx</center>
</body>
</html>

It looks like the site is detecting you as a bot and not allowing you to scrape. You can pretend you are a web browser by following the answer here: Web scraping with Python using BeautifulSoup 429 error

UPDATE:

Adding a user-agent to the request does the trick:

res = requests.get('https://www.raywhite.com/contact/?type=People&target=people&suburb=Sydney%2C+NSW+2000&radius=50%27%27&firstname=&lastname=&_so=contact', headers = {'User-agent': 'Super Bot 9000'})

You now get the desired output.

Some websites reject requests that have no user-agent, and it appears this site does so. Adding in a user-agent makes your request look more normal so the site allows it to go through. There isn't really any standard on this or anything, it varies site by site.

edited Mar 28 '19 at 19:45

answered Mar 28 '19 at 19:23

Salvatore

10,815
4
31
69

Yes that I have also checked while trying to just print the names,locations,... separately. I am not understanding whether the website is not allowing to scrape. Because while printing `print(res.content)` it was not giving any correct output. – ag2019 Mar 28 '19 at 19:31
Thanks a lot . Although it was mt first attempt, they were showing the error code 429. Could you just explain me the use of User agent that you added. I am new to Python programming and just practicing web scraping. – ag2019 Mar 28 '19 at 19:41
@AnkurGuha Some websites reject requests that have no user-agent, and it appears this site does so. Adding in a user-agent makes your request look more normal so the site allows it to go through. There isn't really any standard on this or anything, it varies site by site. I didn't know about this until I did a little research myself. Good luck! – Salvatore Mar 28 '19 at 19:44
Another thing what is the significance of Super Bot value and does it change from one website to another? – ag2019 Mar 28 '19 at 19:53
@AnkurGuha You could put any name you want as the `user-agent`. It just needs to not be empty or whatever the default from the request library is. This way the website doesn't filter out the request. – Salvatore Mar 28 '19 at 20:09

Getting no data from the server when scraping a site

1 Answers1