parse link of url pointing to another url in python

Question

I have a rather strange question regarding urls which point to another url. So, for example, I have a url:

http://mywebpage/this/is/a/forward

which ultimately points to another url:

http://mynewpage/this/is/new

My question is, when I use for example urllib2 in python to fetch the first page, it ultimately fetches the second page. I would like to know if its possible to know what the original link is pointing to. Is there something like a "header" which tells me the second link when I request the first link?

Sorry if this is a really silly question!

score 3 · Accepted Answer · answered Dec 06 '12 at 00:49

3

When you issue a GET request for the first URL, the web server will return a 300-series reply code, with a Location header whose value is the second URL. You can find out what the second URL was from Python with the geturl method of the object returned by urlopen. If there is more than one redirection involved, it appears that urllib will tell you the last hop and there's no way to get the others.

This will not handle redirections via JavaScript or meta http-equiv="refresh", but you probably aren't in that situation or you wouldn't have asked the question the way you did.

answered Dec 06 '12 at 00:49

zwol

135,547
38
252
361

1

+1. Also, it's intentional that `geturl` doesn't give you a way to get the previous redirects. If you really need the whole chain, you almost always want to turn off auto-follow and process the redirects manually (which is pretty easy), at which point you have all the information (full headers, etc.) about each one, not just the URLs. – abarnert Dec 06 '12 at 01:03
@Zack,@abarnert: One additional question: When I use geturl does urllib2 fetch the actual page? or does it not? sorry, I am a newbie here. Thanks again – AJW Dec 06 '12 at 01:04
@JamesW: Are you asking in theory, or in practice? In theory, the page is not guaranteed to be fully fetched until you actually finish reading it (with `u.read()`, `for line in u:`, etc.). If you haven't done any of that, but you have called `u.geturl()`, all that's guaranteed is that it's followed all of the redirects and has gotten at least the headers for the actual page. In practice, of course, it fetches everything as soon as possible, before you do anything with it, but the docs don't guarantee that. Then again, it rarely matters either way, so why do you ask? – abarnert Dec 06 '12 at 02:07
@abarnert: thanks so much for your reply. Well, I was simply curious if geturl() uses less resources than say for example u.read() or are the overheads the same? i.e does it use all resources to fetch the entire page or as you say it just follows the redirects. – AJW Dec 07 '12 at 12:33
@JamesW If you want to avoid downloading the entire page, you should probably be using `httplib` instead, and issuing `HEAD` requests. You'll then have to interpret 300-series responses yourself. – zwol Dec 07 '12 at 14:55
@JamesW: `urlopen` is designed to be a super-simple way to fetch a URL without thinking about it. And not just a web URL—it can fetch ftp and file URLs, and can be extended to fetch other things in case you've got an old gopher server lying around. So if you want control over what resources it's using, you have to do it in a way that's flexible enough to work for all kinds of URLs. That's pretty complicated. That's why it's much easier to drop down a level and use `httplib` as Zack suggests. – abarnert Dec 07 '12 at 18:45
@abarnert, Zack: Thanks both of you for your useful comments. I really appreciate this. – AJW Dec 10 '12 at 11:25

score 0 · Answer 2 · edited May 23 '17 at 12:12

0

It's most commonly done via a redirection response code (3xx) as defined in RFC2616 although a "pseudo redirect effect" cann be achieved with some javascript in the original page.

This SO question is about how to prevent urllib2 from following redirects, it looks like something you might be able to use.

edited May 23 '17 at 12:12

Community

1
1

answered Dec 06 '12 at 00:47

fvu

32,488
6
61
79

score 0 · Answer 3 · answered Dec 06 '12 at 01:05

0

You can do this using requests:

>>> url = 'http://ofa.bo/foagK7'
>>> r = requests.head(url)
>>> r.headers['location']
'https://my.barackobama.com/page/s/what-does-2000-mean-to-you'

answered Dec 06 '12 at 01:05

Brenden Brown

3,125
1
14
15

Does `requests` let you get the whole chain of redirects? Or the complete headers of the redirect instead of just the redirected URL? If so, you should show that to explain why it's better than `urllib2`. If not, why are you suggesting that the OP change libraries for a feature that his existing library does just as well, or maybe even more easily? – abarnert Dec 06 '12 at 02:29

parse link of url pointing to another url in python

3 Answers3