The result of the web page may be the same or different depending on the 'extra parameters'. So, in general, you cannot define rules that detect duplicate content only by looking at the url.
I would suggest to treat url1 and url2 as different.Compute a md5sum of each block of say 1024 words received from the urls. Maintain a hash map of these md5sums to be able to detect duplicates.
Probably some web crawling tools might offer some of the features you need.
Update based on OP's comments: I wrote some code to enhance my answer. There are two versions: the first one is simpler:
def find_matches():
"""
Basic version: reads urls, but does not consider the semantic information of
HTML header, body, etc. while computing duplicates.
"""
from urllib2 import urlopen
import hashlib
urls = [ 'http://www.google.com', 'http://www.google.com/search']
d = {}
url_contents = {}
matches = []
for url in urls:
c = urlopen(url)
url_contents[url] = []
while 1:
r = c.read(4096)
if not r: break
md5 = hashlib.md5(r).hexdigest()
url_contents[url].append(md5)
if md5 in d:
url2 = d[md5]
matches.append((md5, url, url2))
else:
d[md5] = []
d[md5].append(url)
#print url_contents
print matches
if __name__ == '__main__':
find_matches()
It was naive to expect the above code to detect duplicates in the expected way: the current web pages are much too complex. Therefore, even two urls that are the same to the eyes of a user actually have many differences due to ads, hash tags, self-url-name inclusion, etc.
The second version is more sophisticated. It introduces a limited semantic analysis of the content based on BeautifulSoup:
def find_matches():
"""
Some consideration of the HTML header, body, etc. while computing duplicates.
"""
from urllib2 import urlopen
import hashlib
from BeautifulSoup import BeautifulSoup
import pprint
urls = [ 'http://www.google.com', 'http://www.google.com/search'] # assuming all distinct urls
def txt_md5(txt):
return hashlib.md5(txt).hexdigest()
MAX_FILE_SIZE = 1024*1024*1024
d = {}
url_contents = {}
matches = []
for url in urls:
try:
c = urlopen(url)
url_contents[url] = []
r = c.read(MAX_FILE_SIZE)
soup = BeautifulSoup(r)
header = soup.find('head').text
body = soup.find('body').text
# More fine-grained content options
# like h1, h2, p, etc., can be included.
# Common CSS tags like page, content, etc.
# can also be included.
for h in [header, body]:
print h
md5 = txt_md5(h)
url_contents[url].append((md5, h))
if md5 in d:
url2 = d[md5]
matches.append((md5, url, url2))
else:
d[md5] = []
d[md5].append(url)
except Exception as e:
print "Exception", e
print '---------------'
#pprint.pprint(url_contents)
print matches
if __name__ == '__main__':
find_matches()
However, the second version too does not work. The reason remains the same. Indeed the difference between the head texts of the two urls was an included hash value, and the difference between the body texts of the two urls was a string webhp
. I used difflib.context_diff to compute the difference.
It is possible to enhance the code to include a third version that parses the web pages more intelligently and computes the diff more intelligently. For example, declaring as duplicates even the texts with <5% diff (this ratio can be easily computed using a difflib function).