Fist I have to say that I'm quite new to Web scraping with Python. I'm trying to scrape datas using these lines of codes
import requests
from bs4 import BeautifulSoup
baseurl ='https://name_of_the_website.com'
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
print(soup)
As output I do not get the expected Html page but another Html page that says : Misbehaving Content Scraper Please use robots.txt Your IP has been rate limited
To check the problem I wrote:
try:
page_response = requests.get(baseurl, timeout =5)
if page_response.status_code ==200:
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
else:
print(page_response.status_code)
except requests.Timeout as e:
print(str(e))
Then I get 429 (too many requests).
What can I do to handle this problem? Does it mean I cannot print the Html of the page and does it prevent me to scrape any content of the page? Should I rotate the IP address ?