1

How can I detect that these two URLs drive to the same sites (so they are the same URL) without having to use web scraping to read the content?

For example: I need to check (with a GET request)

Both url are the same site, but how can I detect?

I prefer Ruby or Python but I can use any language.

EDIT:

Another case like http://www.inprovo.com/ & http://www.inprovo.com/default.asp. This site have some random banners that change when reloads, so the HTML it's not the same with each reload.

Thank you!

skozz
  • 2,662
  • 3
  • 26
  • 37

3 Answers3

0

Python

Use the urlparse library.

from urlparse import urlparse
>>> urlparse('http://www.n-economia.com/index.asp').netloc
'www.n-economia.com'
>>> urlparse('http://www.n-economia.com/').netloc
'www.n-economia.com'
>>> urlparse('http://www.n-economia.com/index.asp').netloc == urlparse('http://www.n-economia.com/').netloc
True 
Dhiraj Thakur
  • 716
  • 10
  • 25
  • Looks good but what happens in the case of http://www.inprovo.com/ & http://www.inprovo.com/default.asp? This site have some random banners that change when reloads. Thank you! – skozz Jun 20 '14 at 11:52
  • You asked to detect if the urls have same destination without scraping the content. This piece of code only analyzes the url but it doesn't open the urls, like you wanted. If you want to do something else could you please be more specific about what you are trying to do? – Dhiraj Thakur Jun 20 '14 at 12:22
  • I know what you mean but your approach only considers the host: for example (using urlparse): http://www.inprovo.com/sobre_inprovo_quienes_somos.asp & http://www.inprovo.com/ returns `true` but is not the same page. Is `true`because it's the same web, but no the same page. – skozz Jun 20 '14 at 13:46
  • That's because `netloc` only returns the base url of the site. If you want to check if the urls open the same page you'll have to match the `path` attribute. This assumes that the url themselves don't redirect to any other page in which case you'll have to open the urls to make sure they take you to the same page – Dhiraj Thakur Jun 20 '14 at 15:42
  • Seems to make sense but not working. `urlparse('http://www.n-economia.com').path`returns nothing, and `urlparse('http://www.n-economia.com/index.asp').path`returns `/index.asp` so I can not do match. – skozz Jun 20 '14 at 16:02
  • Sorry man but to understand it you must use the example I have given you http://codepad.org/9NUqJHB7 – skozz Jun 20 '14 at 16:46
  • Both outputs in the example are coreect. What do you want to check 'base site is same' or 'the page is same'? – Dhiraj Thakur Jun 20 '14 at 16:54
  • Then my example should work for you. If the path in the url is same then it is the same page. – Dhiraj Thakur Jun 20 '14 at 17:00
  • Hehehe sorry man but is not the same. In the example when you check `urlparse('http://www.n-economia.com')`and `urlparse('http://www.n-economia.com/index.asp').path` returns `false` but the page is the same. http://codepad.org/9NUqJHB7 – skozz Jun 20 '14 at 17:02
  • Then you'll need to open the page, if both these urls display the same page and have different urls then you'd need to send some unique identifier in the source and then parse it via python. The code in my example would work if different urls opened unique pages. I'll update my answer as soon i reach home. – Dhiraj Thakur Jun 20 '14 at 17:16
  • Thank you so much Dhiraj, I think this topic could be interesting to the community. – skozz Jun 20 '14 at 17:53
0

You can use urllib2 in python. Its method urlopen returns a response object. You can check the content of a response using the read() method. If two same responses have the same content then they are the same.

import urllib2
page1 = urllib2.urlopen('http://www.n-economia.com/index.asp')
page2 = urllib2.urlopen('http://www.n-economia.com/')
if page1.read() == page2.read(): print 'same site'
else: print 'different'

EDIT: perhaps I misunderstood your post, but I though it meant you needed to check if two urls link to the same page i.e. they have the same content. If that's not the case, I apologise.

user3725459
  • 414
  • 3
  • 9
  • Looks good but what happens in the case of http://www.inprovo.com/ & http://www.inprovo.com/default.asp? This site have some random banners that change when reloads. Thank you! – skozz Jun 20 '14 at 11:50
0

Finally I got it using a Tf-idf algorithm inspired by the @larsmans answer:

Quote: Tf-idf (and similar text transformations) are implemented in the Python packages Gensim and scikit-learn. In the latter package, computing cosine similarities is as easy as

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

or, if the documents are plain strings,

>>> vect = TfidfVectorizer(min_df=1)
>>> tfidf = vect.fit_transform(["I'd like an apple",
...                             "An apple a day keeps the doctor away",
...                             "Never compare an apple to an orange",
...                             "I prefer scikit-learn to Orange"])
>>> (tfidf * tfidf.T).A
array([[ 1.        ,  0.25082859,  0.39482963,  0.        ],
       [ 0.25082859,  1.        ,  0.22057609,  0.        ],
       [ 0.39482963,  0.22057609,  1.        ,  0.26264139],
       [ 0.        ,  0.        ,  0.26264139,  1.        ]])

Several useful links:

Community
  • 1
  • 1
skozz
  • 2,662
  • 3
  • 26
  • 37