Python follow redirects and then download the page?

Question

I have the following python script and it works beautifully.

import urllib2

url = 'http://abc.com' # write the url here

usock = urllib2.urlopen(url)
data = usock.read()
usock.close()

print data

however, some of the URL's I give it may redirect it 2 or more times. How can I have python wait for redirects to complete before loading the data. For instance when using the above code with

http://www.google.com/search?hl=en&q=KEYWORD&btnI=1

which is the equvilant of hitting the im lucky button on a google search, I get:

>>> url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
>>> usick = urllib2.urlopen(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
    return self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>>

Ive tried the (url, data, timeout) however, I am unsure what to put there.

EDIT: I actually found out if I dont redirect and just used the header of the first link, I can grab the location of the next redirect and use that as my final link

I was not. Just googled it. I can see how to make it NOT follow. However, I can not force it TO follow — Cripto, Jan 11 '12 at 22:34
I know it's been a while, but can you dig deep in the memory vault and tell me how you solved this problem? thanks! — tmthyjames, Feb 19 '15 at 22:19

score 29 · Answer 1 · edited Mar 13 '22 at 19:49

29

Use requests as the other answer states, here is an example. The redirect will be in r.url. In the example below the http is redirected to https

For HEAD:

In [1]: import requests
   ...: r = requests.head('http://github.com', allow_redirects=True)
   ...: r.url

Out[1]: 'https://github.com/'

For GET:

In [1]: import requests
   ...: r = requests.get('http://github.com')
   ...: r.url

Out[1]: 'https://github.com/'

Note for HEAD you have to specify allow_redirects, if you don't you can get it in the headers but this is not advised.

In [1]: import requests

In [2]: r = requests.head('http://github.com')

In [3]: r.headers.get('location')
Out[3]: 'https://github.com/'

To download the page you will need GET, you can then access the page using r.content

edited Mar 13 '22 at 19:49

Michael Delgado

13,789
3
29
54

answered May 30 '18 at 13:47

Glen Thompson

9,071
4
54
50

1

Why is getting it by the header not advised? – Nightforce2 Oct 16 '18 at 19:41
I know this wasn't that long ago but it feels like it, I think I did a validation and found it to not be as reliable, it might also say that in the docs. If you do a validation let me know what you find. – Glen Thompson Oct 18 '18 at 16:22

score 26 · Accepted Answer · edited Mar 27 '20 at 20:20

26

You might be better off with Requests library which has better APIs for controlling redirect handling:

https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history

Requests:

https://pypi.org/project/requests/ (urllib replacement for humans)

edited Mar 27 '20 at 20:20

sebas

1,283
1
12
16

answered Jan 11 '12 at 23:42

Mikko Ohtamaa

82,057
50
264
435

10

@user1048138: Would you mind telling us what you did find to solve your problem? – Peter O. Jan 14 '12 at 00:53
That feature just BLEW my mind. Also, it's important to note for other requests (like HEAD), you must set allow_redirects to True for this to work. – halflings Jul 05 '13 at 14:03
1

While the pointer is correct, this does not immediately address the issue discussed. – cleros Nov 21 '16 at 18:45

Python follow redirects and then download the page?

2 Answers2

Linked