0

I'm writing a python program for the purpose of studying HTML source code used in different countries. I'm testing in a UNIX Shell. The code I have so far works fine, except that I'm getting HTTP Error 403: Forbidden. Through testing it line by line, I know it has something to do with line 27: (url3response = urllib2.urlopen(url3) url3Content =url3response.read()

Every other URL response works fine except this one. Any ideas???

Here is the text file I'm reading from (top5_US.txt):

http://www.caltech.edu
http://www.stanford.edu
http://www.harvard.edu
http://www.mit.edu
http://www.princeton.edu

And here is my code:

import urllib2

#Open desired text file (In this case, "top5_US.txt)
text_file = open('top5_US.txt', 'r')

#Read each line of the text file
firstLine = text_file.readline().strip()
secondLine = text_file.readline().strip()
thirdLine = text_file.readline().strip()
fourthLine = text_file.readline().strip()
fifthLine = text_file.readline().strip()

#Turn each line into a URL variable
url1 = firstLine
url2 = secondLine
url3 = thirdLine
url4 = fourthLine
url5 = fifthLine

#Read URL 1, get content , and store it in a variable.
url1response = urllib2.urlopen(url1)
url1Content =url1response.read()

#Read URL 2, get content , and store it in a variable.
url2response = urllib2.urlopen(url2)
url2Content =url2response.read()

#Read URL 3, get content , and store it in a variable.
url3response = urllib2.urlopen(url3)
url3Content =url3response.read()

#Read URL 4, get content , and store it in a variable.
url4response = urllib2.urlopen(url4)
url4Content =url4response.read()

#Read URL 5, get content , and store it in a variable.
url5response = urllib2.urlopen(url5)
url5Content =url5response.read()

text_file.close()
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
shadewolf
  • 269
  • 1
  • 2
  • 8
  • 3
    Probably not the problem, but: why aren't you using lists? `url5response`? Why not `responses[4]`? – jonrsharpe Feb 14 '17 at 22:06
  • I did this simply for readability on my end, but you make a good point. It certainly would shorten this code quite a bit. – shadewolf Feb 14 '17 at 22:08
  • 2
    `403 Forbidden` is probably an answer from a firewall between you and `harvard.edu`? Does `curl http://www.harvard.edu` work from the command line? – Andomar Feb 14 '17 at 22:09
  • Please try to be specific enough to identify your individual question in your question's title. The original title of *"What is this error message, and why is it happening?"* could probably apply to 50% of all StackOverflow questions -- it doesn't specify the error message itself *or* what you're doing to get it. Part of the point of a good title is to help others identify when they have the same problem, and to do that requires specificity. – Charles Duffy Feb 14 '17 at 22:28
  • This is a duplicate of http://stackoverflow.com/questions/3336549/pythons-urllib2-why-do-i-get-error-403-when-i-urlopen-a-wikipedia-page – jgritty Feb 15 '17 at 00:18
  • If you catch the exception and print the page, you'll see: `Access denied | www.harvard.edu used Cloudflare to restrict access` – jgritty Feb 15 '17 at 00:25
  • http://pastebin.com/a7cjRbwR might find this code useful. – jgritty Feb 15 '17 at 00:27

3 Answers3

2

A 403 forbidden error means that you do not have the necessary permissions to see/download the page. This particular site may have some sort of DDoS prevention thing which prevents scripts from looking at it.

clubby789
  • 2,543
  • 4
  • 16
  • 32
2

It looks like the Python user-agent is blocked.

$ curl -D - http://www.harvard.edu -o /dev/null
HTTP/1.1 200 Ok
...
$ curl -H 'User-Agent: Python-urllib/2.7' -D - http://www.harvard.edu -o /dev/null
HTTP/1.1 403 Forbidden
...

Obviously, user-agent spoofing is a possible solution. However, I would consider unethical to simply spoof user-agents without at least parsing the robots.txt file first, and obeying it.

Please be conscientious when spidering. See: How to be a good citizen when crawling web sites

The body of the 403 response has the following message:

The owner of this website (www.harvard.edu) has banned your access based on your browser's signature (3313e52986a2470a-ua48).

Community
  • 1
  • 1
Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
2

Like @Jammy Dodger said, you have to provide a user agent:

request = urllib2.Request(
    "http://www.harvard.edu", 
    headers = {'User-Agent': 'Mozilla/5.0'})
print(urllib2.urlopen(request).read())

But the site seems to be very JavaScript centric. You couldn't do anything with the reply without a full-fledged HTML client.

Andomar
  • 232,371
  • 49
  • 380
  • 404