2

I'm trying to download the HTML of a page (http://www.guangxindai.com in this case) but I'm getting back an error 403. Here is my code:

import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
f = opener.open("http://www.guangxindai.com")
f.read()

but I get error response.

Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    f = opener.open("http://www.guangxindai.com")
  File "C:\Python33\lib\urllib\request.py", line 475, in open
    response = meth(req, response)
  File "C:\Python33\lib\urllib\request.py", line 587, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python33\lib\urllib\request.py", line 513, in error
    return self._call_chain(*args)
  File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
    result = func(*args)
  File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

I have tried different request headers, but still can not get correct response. I can view the web through browser. It seems strange for me. I guess the web use some method to block web spider. Does anyone know what is happening? How can I get the HTML of page correctly?

zhangzhai
  • 21
  • 1
  • 4
  • Well by the information provided we can only infer what is in the rfc: `403 Forbidden The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.`(see [here](http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html)) – jlnabais Oct 08 '15 at 13:20
  • Howerver Wikipedia ([here](https://en.wikipedia.org/wiki/HTTP_403)) has a list of "subcodes", not sure if urllib has support for you to check those subcodes. – jlnabais Oct 08 '15 at 13:20

2 Answers2

2

I was having the same problem that you and I found the answer in this link.

The answer provided by Stefano Sanfilippo is quite simple and worked for me:

from urllib.request import Request, urlopen

url_request = Request("http://www.guangxindai.com", 
                      headers={"User-Agent": "Mozilla/5.0"})
webpage = urlopen(url_request).read()
Community
  • 1
  • 1
-2

If your aim is to read the html of the page you can use the following code. It worked for me on Python 2.7

import urllib
f = urllib.urlopen("http://www.guangxindai.com")
f.read()
Mahesh B
  • 143
  • 2
  • 13
  • Even if this code makes the example functional, I think @zhangzhai wanted an explanation for the reason he got the 403. – jlnabais Oct 08 '15 at 13:24
  • When I run that on 2.7.10 I get a page back, but it's just the 403 error page. It has a line like this: `` – wpercy Oct 08 '15 at 13:26
  • Thanks for your response. I think the answer may not be so simple. I have tried different request headers, but still can not get correct response. I can view the web through browser. It seems strange for me. – zhangzhai Oct 08 '15 at 13:38