1

I'm using urllib2 to request for URLs and read their contents but unfortunately it's not working for some URLs. look at these commands:

#No problem with this URL
urllib2.urlopen('http://www.huffingtonpost.com/2014/07/19/todd-akin-slavery_n_5602083.html')
#This one produced error
urllib2.urlopen('http://www.foxnews.com/us/2014/07/19/cartels-suspected-as-high-caliber-gunfire-sends-border-patrol-scrambling-on-rio/')

The second URL produced and error like this:

Traceback (most recent call last):
  File "D:/Developer Center/Republishan/republishan2/republishan2/test.py", line 306, in <module>
    urllib2.urlopen('http://www.foxnews.com/us/2014/07/19/cartels-suspected-as-high-caliber-gunfire-sends-border-patrol-scrambling-on-rio/')
  File "C:\Python27\lib\urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 410, in open
    response = meth(req, response)
  File "C:\Python27\lib\urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python27\lib\urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

What's the problem with this?

ehsan shirzadi
  • 4,709
  • 16
  • 69
  • 112
  • 1
    This answer worked with the url you provided, using urllib2 and changing the user-agent: http://stackoverflow.com/a/5196160/2679935 – julienc Jul 20 '14 at 08:58

1 Answers1

6

I think the site is checking for a User-Agent and or other headers which urllib doesn't set by default.

You can set a User-Agent manually.

Requests library sets a user-agent automatically.

However remember that requests user-agent may also be blocked by some sites.

Try this. This is working for me. You need to install the requests module first!

pip install requests

Then

import requests

r = requests.get("http://www.foxnews.com/us/2014/07/19/cartels-suspected-as-high-caliber-gunfire-sends-border-patrol-scrambling-on-rio/")

print r.text

Urllib is hard and you've to code more. Requests is simpler and is more in line with the Python philosophy that code should be beautiful!

Wally
  • 432
  • 6
  • 19