42

I have the following python script and it works beautifully.

import urllib2

url = 'http://abc.com' # write the url here

usock = urllib2.urlopen(url)
data = usock.read()
usock.close()

print data

however, some of the URL's I give it may redirect it 2 or more times. How can I have python wait for redirects to complete before loading the data. For instance when using the above code with

http://www.google.com/search?hl=en&q=KEYWORD&btnI=1

which is the equvilant of hitting the im lucky button on a google search, I get:

>>> url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
>>> usick = urllib2.urlopen(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
    return self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>> 

Ive tried the (url, data, timeout) however, I am unsure what to put there.

EDIT: I actually found out if I dont redirect and just used the header of the first link, I can grab the location of the next redirect and use that as my final link

Cripto
  • 3,581
  • 7
  • 41
  • 65

2 Answers2

29

Use requests as the other answer states, here is an example. The redirect will be in r.url. In the example below the http is redirected to https

For HEAD:

In [1]: import requests
   ...: r = requests.head('http://github.com', allow_redirects=True)
   ...: r.url

Out[1]: 'https://github.com/'

For GET:

In [1]: import requests
   ...: r = requests.get('http://github.com')
   ...: r.url

Out[1]: 'https://github.com/'

Note for HEAD you have to specify allow_redirects, if you don't you can get it in the headers but this is not advised.

In [1]: import requests

In [2]: r = requests.head('http://github.com')

In [3]: r.headers.get('location')
Out[3]: 'https://github.com/'

To download the page you will need GET, you can then access the page using r.content

Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
Glen Thompson
  • 9,071
  • 4
  • 54
  • 50
  • 1
    Why is getting it by the header not advised? – Nightforce2 Oct 16 '18 at 19:41
  • I know this wasn't that long ago but it feels like it, I think I did a validation and found it to not be as reliable, it might also say that in the docs. If you do a validation let me know what you find. – Glen Thompson Oct 18 '18 at 16:22
26

You might be better off with Requests library which has better APIs for controlling redirect handling:

https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history

Requests:

https://pypi.org/project/requests/ (urllib replacement for humans)

sebas
  • 1,283
  • 1
  • 12
  • 16
Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435
  • 10
    @user1048138: Would you mind telling us what you did find to solve your problem? – Peter O. Jan 14 '12 at 00:53
  • That feature just BLEW my mind. Also, it's important to note for other requests (like HEAD), you must set allow_redirects to True for this to work. – halflings Jul 05 '13 at 14:03
  • 1
    While the pointer is correct, this does not immediately address the issue discussed. – cleros Nov 21 '16 at 18:45