In Python, how do i check if 2 different links actually point to the same page?

Question

For example, these 2 links point to the same location:

http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html

http://www.independent.co.uk/life-style/gadgets-and-tech/news/2292113.html

How do i check this in python?

What about using the ID 2292113? – garnertb Jun 02 '11 at 21:11 — garnertb, Jun 02 '11 at 21:11

score 12 · Accepted Answer · answered Jun 02 '11 at 21:19

Call geturl() on the result of urllib2.urlopen(). geturl() "returns the URL of the resource retrieved, commonly used to determine if a redirect was followed."

For example:

#!/usr/bin/env python
# coding: utf-8

import urllib2

url1 = 'http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html'
url2 = 'http://www.independent.co.uk/life-style/gadgets-and-tech/news/2292113.html'

for url in [url1, url2]:
    result = urllib2.urlopen(url)
    print result.geturl()

The output is:

http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html
http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html

Exactly what I was coming here to write. To make it more efficient, use HEAD requests instead of loading the full body of the page: http://stackoverflow.com/questions/107405/how-do-you-send-a-head-http-request-in-python/2070916#2070916 — nearlymonolith, Jun 02 '11 at 21:23
This looks likes what I am looking for! Thanks! Though, ideally i would like to look for page similarities even if the links do not redirect, this should work for now. — tapan, Jun 02 '11 at 21:24

score 2 · Answer 2 · answered Jun 02 '11 at 21:12

2

It's impossible to discern that merely from the URLs, obviously.

You could fetch the content and compare it, but then I imagine you'd have to use a smart criterion to decide when two pages are the same -- say, for example, that both point to the same article, but a random advertising comes different, or related articles change depending on other factors.

Design your program in such a way that the criterion for matching pages is easily replaced, even dynamically, and try until you find one that doesn't fail -- for example, for a newspaper page, you could try finding headlines.

answered Jun 02 '11 at 21:12

salezica

74,081
25
105
166

Indeed the two URLs above have content that is the same except for a randomised number in a tracking script. Perhaps parsing the page HTML and extracting only the textual content would be a good first attempt. – bobince Jun 02 '11 at 21:17
Is there a way to follow redirects in a url ? like if i wget the 2nd link, it goes to the first link. So i am assuming, i should be able to get the redirected link without having to actually get the page. – tapan Jun 02 '11 at 21:19
If a page redirects to another, then they are the same; but two pages can be the same page with no explicit redirection. The server can have its own aliases, the url won't help you. – salezica Jun 02 '11 at 21:21

In Python, how do i check if 2 different links actually point to the same page?

2 Answers2