0

I just got started with the urllib module. I'm trying to scrape products from supermarkets and there's a website that seems to always respond with an HTTP Error 429: Too many requests. I already did a bit of research on the Stack Overflow and no one seems to have the same problem. My code is as simple as it can get:

>>> import urllib.request
>>> resp = urllib.request.urlopen("https://shop.coles.com.au/a/a-national/product/head-shoulders-shampoo-conditioner-2in1-deep-clean")
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    resp = urllib.request.urlopen("https://shop.coles.com.au/a/a-national/product/head-shoulders-shampoo-conditioner-2in1-deep-clean")
  File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
  File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
  File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 640, in http_response
'http', request, response, code, msg, hdrs)
  File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 568, in error
return self._call_chain(*args)
  File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
  File "C:\Users\thank\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 648, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 429: Too Many Requests

I've also tried to modify the user-agent as this answer suggests, but the result is still the same

Can someone explain which default settings inside the urllib module may cause the problem? Or is it because the website blocks bots? Other product pages of the website don't work either.

Mike Pham
  • 437
  • 6
  • 17
  • 1
    429 means you're spamming their endpoints more than they would like. Sometimes in the body of the request they'll tell you how long to wait until trying again, so I'd start there. – Shadow Feb 17 '19 at 02:01
  • If you're just getting started with learning how to scrape web sites you should not be accessing public sites. You should experiment, and learn, by scraping your own web server. Then take some time to learn about mechanisms like the `robots.txt` file and scraping best practices. – Kurtis Rader Feb 17 '19 at 20:37

1 Answers1

1

429 is server asking you to stop. Basically, the web server thinks you are trying to spam or scrape and it doesn't like it. Generally you should honor the server and if there is try after some time with 429 response you should follow it.

If you feel you are wrongly been asked by the server, either you can make sure that your user request is **similar" to the user request generated by an user from the browser, which will include user-agent and all the other information a regular browser would send with the request. If the server is sending you 429 despite that most probably either it has blocked your ip temporarily or permanently. In that you should look how to scrape through multiple ips.

Biswanath
  • 9,075
  • 12
  • 44
  • 58