Why would urllib.urlopen() give a HTTP 403 connecting to a major university's web site?

Question

I'm writing a python program for the purpose of studying HTML source code used in different countries. I'm testing in a UNIX Shell. The code I have so far works fine, except that I'm getting HTTP Error 403: Forbidden. Through testing it line by line, I know it has something to do with line 27: (url3response = urllib2.urlopen(url3) url3Content =url3response.read()

Every other URL response works fine except this one. Any ideas???

Here is the text file I'm reading from (top5_US.txt):

http://www.caltech.edu
http://www.stanford.edu
http://www.harvard.edu
http://www.mit.edu
http://www.princeton.edu

And here is my code:

import urllib2

#Open desired text file (In this case, "top5_US.txt)
text_file = open('top5_US.txt', 'r')

#Read each line of the text file
firstLine = text_file.readline().strip()
secondLine = text_file.readline().strip()
thirdLine = text_file.readline().strip()
fourthLine = text_file.readline().strip()
fifthLine = text_file.readline().strip()

#Turn each line into a URL variable
url1 = firstLine
url2 = secondLine
url3 = thirdLine
url4 = fourthLine
url5 = fifthLine

#Read URL 1, get content , and store it in a variable.
url1response = urllib2.urlopen(url1)
url1Content =url1response.read()

#Read URL 2, get content , and store it in a variable.
url2response = urllib2.urlopen(url2)
url2Content =url2response.read()

#Read URL 3, get content , and store it in a variable.
url3response = urllib2.urlopen(url3)
url3Content =url3response.read()

#Read URL 4, get content , and store it in a variable.
url4response = urllib2.urlopen(url4)
url4Content =url4response.read()

#Read URL 5, get content , and store it in a variable.
url5response = urllib2.urlopen(url5)
url5Content =url5response.read()

text_file.close()

Probably not the problem, but: why aren't you using lists? `url5response`? Why not `responses[4]`? — jonrsharpe, Feb 14 '17 at 22:06
I did this simply for readability on my end, but you make a good point. It certainly would shorten this code quite a bit. — shadewolf, Feb 14 '17 at 22:08
`403 Forbidden` is probably an answer from a firewall between you and `harvard.edu`? Does `curl http://www.harvard.edu` work from the command line? — Andomar, Feb 14 '17 at 22:09
Please try to be specific enough to identify your individual question in your question's title. The original title of *"What is this error message, and why is it happening?"* could probably apply to 50% of all StackOverflow questions -- it doesn't specify the error message itself *or* what you're doing to get it. Part of the point of a good title is to help others identify when they have the same problem, and to do that requires specificity. — Charles Duffy, Feb 14 '17 at 22:28
This is a duplicate of http://stackoverflow.com/questions/3336549/pythons-urllib2-why-do-i-get-error-403-when-i-urlopen-a-wikipedia-page — jgritty, Feb 15 '17 at 00:18
If you catch the exception and print the page, you'll see: `Access denied | www.harvard.edu used Cloudflare to restrict access` — jgritty, Feb 15 '17 at 00:25

score 2 · Answer 1 · answered Feb 14 '17 at 22:09

2

A 403 forbidden error means that you do not have the necessary permissions to see/download the page. This particular site may have some sort of DDoS prevention thing which prevents scripts from looking at it.

answered Feb 14 '17 at 22:09

clubby789

2,543
4
16
32

Is there any way to get it around it? – shadewolf Feb 14 '17 at 22:13
This seems to be a similar problem http://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden – clubby789 Feb 14 '17 at 22:14

score 2 · Answer 2 · edited Apr 13 '17 at 12:45

It looks like the Python user-agent is blocked.

$ curl -D - http://www.harvard.edu -o /dev/null
HTTP/1.1 200 Ok
...
$ curl -H 'User-Agent: Python-urllib/2.7' -D - http://www.harvard.edu -o /dev/null
HTTP/1.1 403 Forbidden
...

Obviously, user-agent spoofing is a possible solution. However, I would consider unethical to simply spoof user-agents without at least parsing the robots.txt file first, and obeying it.

Please be conscientious when spidering. See: How to be a good citizen when crawling web sites

The body of the 403 response has the following message:

The owner of this website (www.harvard.edu) has banned your access based on your browser's signature (3313e52986a2470a-ua48).

score 2 · Answer 3 · answered Feb 14 '17 at 22:24

Like @Jammy Dodger said, you have to provide a user agent:

request = urllib2.Request(
    "http://www.harvard.edu", 
    headers = {'User-Agent': 'Mozilla/5.0'})
print(urllib2.urlopen(request).read())

But the site seems to be very JavaScript centric. You couldn't do anything with the reply without a full-fledged HTML client.

Why would urllib.urlopen() give a HTTP 403 connecting to a major university's web site?

3 Answers3