19

For example, if I go to www.yahoo.com/thispage, and yahoo has set up a filter to redirect /thispage to /thatpage. So whenever someone goes to /thispage, they will land on /thatpage.

If I use httplib/requests/urllib, will it know that there was a redirection? What error pages? Some sites redirect user to /errorpage whenever the page cannot be found.

martineau
  • 119,623
  • 25
  • 170
  • 301
iCodeLikeImDrunk
  • 17,085
  • 35
  • 108
  • 169
  • 2
    What is the problem you are trying to solve? How is your code not doing the right thing? If you merely want to know about error modes, test this behaviour yourself. – Marcin Nov 20 '12 at 21:52
  • 2
    Check http://stackoverflow.com/questions/554446/how-do-i-prevent-pythons-urllib2-from-following-a-redirect – OneOfOne Nov 20 '12 at 21:53
  • 1
    @Marcin I have a huge list(1k+) of urls to test if they are up or not. I randomly chose 40-50 of them to test manually, I see that some are getting redirected to an error page whenever a page cannot be found. Also I see many urls been redirected as well because the url pattern has changed, same names just written differently. – iCodeLikeImDrunk Nov 20 '12 at 22:01
  • 2
    @OneOfOne that sorta looks like what i need, ill check it out. thanks! – iCodeLikeImDrunk Nov 20 '12 at 22:02

4 Answers4

29

With requests, you get a listing of any redirects in the .history attribute of the response object. It returns a Python list. See the documentation for more.

Employee
  • 2,231
  • 3
  • 33
  • 60
MikeHunter
  • 4,144
  • 1
  • 19
  • 14
19

To prevent requests from following redirects use:

r = requests.get('http://www.yahoo.com/thispage', allow_redirects=False)

If it is in indeed a redirect, you can check the redirect target location in r.headers['location'].

yonilevy
  • 5,320
  • 3
  • 31
  • 27
3

The accepted answer is the correct first option, but in some cases if the site redirects with a meta tag they also have a canonical link specified once they redirect. In this example let me try to request http://en.wikipedia.org/wiki/Google_Inc_Class_A from wikipedia, which is a url that redirects.

>> request = requests.get('http://en.wikipedia.org/wiki/Google_Inc_Class_A')

I check and:

>> request.history
[]

An alternative is to try and pull the canonical url which should hopefully have what you're been redirected to. (Note I'm using BeautifulSoup here as well)

>> soup = BeautifulSoup(request._content)
>> canonical = soup.find('link', {'rel': 'canonical'})
>> canonical['href']
'http://en.wikipedia.org/wiki/Google'

Which does match the url you get redirected to in this particular case. So to be clear, this is an ugly second option but worth trying if all else fails.

dlb8685
  • 351
  • 3
  • 10
  • For future readers: I just checked this example and the history is correctly populated: `requests.get('http://en.wikipedia.org/wiki/Google_Inc_Class_A', allow_redirects=True)`. I don't know if it's due to "allow_redirects" parameters or to a new version of requests package. – Alberto Coletta Jul 20 '16 at 15:41
1

It depends on how they are doing the redirection. The "right" way is to return a redirected HTTP status code (301/302/303). The "wrong" way is to place a refresh meta tag in the HTML.

If they do the former, requests will handle it transparently. Note that any sane error page redirect will still have an error status code (e.g. 404) which you can check as response.status_code.

Katriel
  • 120,462
  • 19
  • 136
  • 170