3

For example, these 2 links point to the same location:

http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html

http://www.independent.co.uk/life-style/gadgets-and-tech/news/2292113.html

How do i check this in python?

Community
  • 1
  • 1
tapan
  • 1,776
  • 2
  • 18
  • 31

2 Answers2

12

Call geturl() on the result of urllib2.urlopen(). geturl() "returns the URL of the resource retrieved, commonly used to determine if a redirect was followed."

For example:

#!/usr/bin/env python
# coding: utf-8

import urllib2

url1 = 'http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html'
url2 = 'http://www.independent.co.uk/life-style/gadgets-and-tech/news/2292113.html'

for url in [url1, url2]:
    result = urllib2.urlopen(url)
    print result.geturl()

The output is:

http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html
http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html
Gregg
  • 3,236
  • 20
  • 15
  • 4
    Exactly what I was coming here to write. To make it more efficient, use HEAD requests instead of loading the full body of the page: http://stackoverflow.com/questions/107405/how-do-you-send-a-head-http-request-in-python/2070916#2070916 – nearlymonolith Jun 02 '11 at 21:23
  • This looks likes what I am looking for! Thanks! Though, ideally i would like to look for page similarities even if the links do not redirect, this should work for now. – tapan Jun 02 '11 at 21:24
2

It's impossible to discern that merely from the URLs, obviously.

You could fetch the content and compare it, but then I imagine you'd have to use a smart criterion to decide when two pages are the same -- say, for example, that both point to the same article, but a random advertising comes different, or related articles change depending on other factors.

Design your program in such a way that the criterion for matching pages is easily replaced, even dynamically, and try until you find one that doesn't fail -- for example, for a newspaper page, you could try finding headlines.

salezica
  • 74,081
  • 25
  • 105
  • 166
  • Indeed the two URLs above have content that is the same except for a randomised number in a tracking script. Perhaps parsing the page HTML and extracting only the textual content would be a good first attempt. – bobince Jun 02 '11 at 21:17
  • Is there a way to follow redirects in a url ? like if i wget the 2nd link, it goes to the first link. So i am assuming, i should be able to get the redirected link without having to actually get the page. – tapan Jun 02 '11 at 21:19
  • If a page redirects to another, then they are the same; but two pages can be the same page with no explicit redirection. The server can have its own aliases, the url won't help you. – salezica Jun 02 '11 at 21:21