1

I am figuring the best way how to check if two or more url duplicated in the case they have some extra parameters like the code below. In fac, url1 and url2 is same, but when running the webspider, it will treat as two separate url and the result would be duplicated.

from urllib2 import urlopen
import hashlib

url1 = urlopen('http://www.time.com/time/nation/article/0,8599,2109975,00.html?xid=gonewssedit')
u1 = hashlib.md5(u1).hexdigest() 
url2 = urlopen('http://www.time.com/time/nation/article/0,8599,2109975,00.html')
u2 = hashlib.md5(u2).hexdigest() 
if u1 == u2:
    print 'yes'
else:
    print 'no'

In short, I will generate the md5 hash by using the url header, then store it in the database, then when I crawl the new url I can check if it is duplicated or not. But I am not sure it is the best way to this work in Python.

Thank you very much

mrblue
  • 807
  • 2
  • 12
  • 24

3 Answers3

3

The result of the web page may be the same or different depending on the 'extra parameters'. So, in general, you cannot define rules that detect duplicate content only by looking at the url.

I would suggest to treat url1 and url2 as different.Compute a md5sum of each block of say 1024 words received from the urls. Maintain a hash map of these md5sums to be able to detect duplicates.

Probably some web crawling tools might offer some of the features you need.


Update based on OP's comments: I wrote some code to enhance my answer. There are two versions: the first one is simpler:

def find_matches():
    """
        Basic version: reads urls, but does not consider the semantic information of
        HTML header, body, etc. while computing duplicates.
    """

    from urllib2 import urlopen
    import hashlib

    urls = [ 'http://www.google.com', 'http://www.google.com/search']

    d = {}
    url_contents = {}
    matches = []
    for url in urls:
        c = urlopen(url)
        url_contents[url] = []
        while 1:
            r = c.read(4096)
            if not r: break
            md5 = hashlib.md5(r).hexdigest()
            url_contents[url].append(md5)
            if md5 in d:
                url2 = d[md5]
                matches.append((md5, url, url2))
            else:
                d[md5] = []
            d[md5].append(url)
    #print url_contents
    print matches

if __name__ == '__main__':
    find_matches()

It was naive to expect the above code to detect duplicates in the expected way: the current web pages are much too complex. Therefore, even two urls that are the same to the eyes of a user actually have many differences due to ads, hash tags, self-url-name inclusion, etc.

The second version is more sophisticated. It introduces a limited semantic analysis of the content based on BeautifulSoup:

def find_matches():
    """
        Some consideration of the HTML header, body, etc. while computing duplicates.
    """

    from urllib2 import urlopen
    import hashlib
    from BeautifulSoup import BeautifulSoup
    import pprint

    urls = [ 'http://www.google.com', 'http://www.google.com/search'] # assuming all distinct urls

    def txt_md5(txt):
        return hashlib.md5(txt).hexdigest()

    MAX_FILE_SIZE = 1024*1024*1024 
    d = {}
    url_contents = {}
    matches = []
    for url in urls:
        try:
            c = urlopen(url)
            url_contents[url] = []
            r = c.read(MAX_FILE_SIZE)
            soup = BeautifulSoup(r)
            header = soup.find('head').text
            body = soup.find('body').text 
            # More fine-grained content options 
            # like h1, h2, p, etc., can be included.
            # Common CSS tags like page, content, etc.
            # can also be included.
            for h in [header, body]:
                print h
                md5 = txt_md5(h)
                url_contents[url].append((md5, h))
                if md5 in d:
                    url2 = d[md5]
                    matches.append((md5, url, url2))
                else:
                    d[md5] = []
                d[md5].append(url)
        except Exception as e:
            print "Exception", e
    print '---------------'
    #pprint.pprint(url_contents)
    print matches

if __name__ == '__main__':
    find_matches()

However, the second version too does not work. The reason remains the same. Indeed the difference between the head texts of the two urls was an included hash value, and the difference between the body texts of the two urls was a string webhp. I used difflib.context_diff to compute the difference.

It is possible to enhance the code to include a third version that parses the web pages more intelligently and computes the diff more intelligently. For example, declaring as duplicates even the texts with <5% diff (this ratio can be easily computed using a difflib function).

Community
  • 1
  • 1
amit kumar
  • 20,438
  • 23
  • 90
  • 126
  • hi phaedurs, this is wonderful. thank you very much for your answer. would you like to give me some detailed tips about "maintain a hash map of these md5sums to be able to detect duplicates". Actually, I have learned scrappy, but it's quite complicated to my current project. – mrblue Mar 24 '12 at 17:50
1

There is no way to know whether two URIs point to the same resource without retrieving both of them. And even if they are fundamentally the same content, they may have dynamic elements such as ads that change with each request anyway, making it difficult to detect programmatically whether the two URIs are the same.

kindall
  • 178,883
  • 35
  • 278
  • 309
0

Maybe try it like this?

from urlparse import urlparse 

websites = set()

def is_unique(website):
    # Strip of the last bit
    parsed = urlparse(website)
    url = parsed.hostname + parsed.path
    if url in websites:
        return False
    websites.add(url)
    return True
Jakob Bowyer
  • 33,878
  • 8
  • 76
  • 91