1

Fist I have to say that I'm quite new to Web scraping with Python. I'm trying to scrape datas using these lines of codes

import requests
from bs4 import BeautifulSoup
baseurl ='https://name_of_the_website.com'
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
print(soup)

As output I do not get the expected Html page but another Html page that says : Misbehaving Content Scraper Please use robots.txt Your IP has been rate limited

To check the problem I wrote:

try:
page_response = requests.get(baseurl, timeout =5)
 if page_response.status_code ==200:
   html_page = requests.get(baseurl).text
   soup = BeautifulSoup(html_page, 'html.parser')

 else:
  print(page_response.status_code)
except requests.Timeout as e:
print(str(e))

Then I get 429 (too many requests).

What can I do to handle this problem? Does it mean I cannot print the Html of the page and does it prevent me to scrape any content of the page? Should I rotate the IP address ?

Giorgetto
  • 55
  • 1
  • 5
  • 1
    429 indicates that you ran your code too many times, not that there's anything wrong with the code per se. – tripleee Aug 01 '18 at 16:58
  • 2
    Possible duplicate of [How to avoid HTTP error 429 (Too Many Requests) python](https://stackoverflow.com/questions/22786068/how-to-avoid-http-error-429-too-many-requests-python) – tripleee Aug 01 '18 at 16:58
  • Thanks for the answer but actually I do not get an error in the Python command line; I get '429' as output to the second part of codes. Is it the same thing? What do you mean when you say I run the code too many times? I wasn't able to print the Html even the first time I ran it. – Giorgetto Aug 01 '18 at 17:21

1 Answers1

7

If you are only hitting the page once and getting a 429 it's probably not you hitting them too much. You can't be sure the 429 error is accurate, it's simply what their webserver returned. I've seen pages return a 404 response code, yet the page was fine, and 200 response code on legit missing pages, just a misconfigured server. They may just return 429 from any bot, try changing your User-Agent to Firefox, Chrome, or "Robot Web Scraper 9000" and see what you get. Like this:

requests.get(baseurl, headers = {'User-agent': 'Super Bot Power Level Over 9000'})

to declare yourself as a bot or

requests.get(baseurl, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})

If you wish to more mimic a browser. Note all the version stuff mimicing a browser, at the time of this writing those are current. You may need later version numbers. Just find your user agent of the browser you use, this page will tell you what that is:

https://www.whatismybrowser.com/detect/what-is-my-user-agent

Some sites return better searchable code if you just say you are a bot, others it's the opposite. It's basically the wild wild west, have to just try different things.

Another pro tip, you may have to write your code to have a 'cookie jar' or a way to accept a cookie. Usually it is just an extra line in your request, but I'll leave that for another stackoverflow question :)

If you are indeed hitting them a lot, you need to sleep between calls. It's a server side response completely controlled by them. You will also want to investigate how your code interacts with robots.txt, that's a file usually on the root of the webserver with the rules it would like your spider to follow.

You can read about that here: Parsing Robots.txt in python

Spidering the web is fun and challenging, just remember that you could be blocked at anytime by any site for any reason, you are their guest. So tread nicely :)

sniperd
  • 5,124
  • 6
  • 28
  • 44
  • Thanks for the aswer I think you hit the point. However I do not succed both with the User Agent spoofing you suggested me and with an attempt of IP rotation. How can I undestand if I have been blocked? – Giorgetto Aug 01 '18 at 20:36
  • Sites can really only block by IP address, or they can try to hand you a `cookie` and if you don't accept it they assume you are a bot and will sometimes reject your request. However, most sites _want_ to be spidered, that's how they generate traffic. You could also always contact the webmaster (if it's a small site). If you've found my answer helpful, please accept it :) and welcome to StackOverflow! :) – sniperd Aug 02 '18 at 12:49
  • @Giorgetto, I am stuck with the same problem. Did you find any resolutions to get the html page into python. – Mischief_Monkey May 23 '20 at 15:50