How to compare the similarity of documents with Simhash algorithm?

Question

I'm currently creating a program that can compute near-dupliate score within a corpus of text documents (+5000 docs). I'm using Simhash to generate a uniq footprint of a document (thanks to this github repo)

my datas are :

data = {
    1: u'Im testing simhash algorithm.',
    2: u'test of simhash algorithm',
    3: u'This is simhash test.',
}

and this gives me 3 hash like this :

00100110101110100011111000100010010101011001000001110000111001011100110101001101111010100010001011001011000110000100110101100110

00001001110010000000011000001000110010001010000101010000001100000100100011100100110010100000010000000110001001010110000010000100

10001110101100000100101010000010010001011010001000000000101000101100001100100000110011000000011001000000000110000000100110000000

And now, how to compare those 3 hashes ? I know that I have to split them into blocks but dont have the exact method ?

What i want to do is the output all the duplicated documents (>70%) with their ID and the IDs of duplicates docs.

Can someone help ?

score 8 · Accepted Answer · edited Jul 21 '22 at 12:53

Before I answer your question, it is important to keep in mind:

Simhash is useful as it detects near duplicates. This means that near duplicates will end up with the same hash.
For exact duplicates you can simply use any one way, consistent hashing mechanism (ex. md5)
The examples that you pasted here are too small and given their size, their differences are significant. The algorithm is tailored to work with large Web Documents and not small sentences.

Now, I have replied to your question on the Github issue that you raised here.

For reference though, here is some sample code you can use to print the final near duplicate documents after hashing them.

# assuming that you have a dictionary with document id as the key and the document as the value: 
# documents = { doc_id: doc } you can do:

from simhash import simhash

def split_hash(str, num):
    return [ str[start:start+num] for start in range(0, len(str), num) ]

hashes = {}
for doc_id, doc in documents.items():
    hash = simhash(doc)
    
    # you can either use the whole hash for higher precision or split into chunks for higher recall
    hash_chunks = split_hash(hash, 4)
    
    for chunk in hash_chunks:
        if chunk not in hashes:
            hashes[chunk] = []
        hashes[chunk].append(doc_id)

# now you can print the duplicate documents:
for hash, doc_list in hashes:
    if len(doc_list) > 1:  # if the length of doc is greater than 1
        print("Duplicates documents: ", doc_list)

Please let me know if something is not clear.

The simhash algorithm doesn't rely on near duplicates ending up with "the same hash". Rather, near duplicates will tend to have hashes that differ only by a small number of bits. If you only look for _identical_ hashes you're not really using simhash properly and will have very limited results. — Ben Whitmore, Mar 08 '21 at 00:02

score 2 · Answer 2 · answered Aug 06 '19 at 14:32

Further to Memos' answer, if you want to detect >=70% similarity, you cannot use simhash. Simhash only allows quite small hamming distances to be detected, up to around 6 or 7 bits difference, at a stretch, depending on the size of your corpus. For 70% similarity you'd have to allow 19 bits difference, which is not possible in any normal scenario. You should instead look into minhash.

If you're interested, here's an extensive explanation of simhash.

How to compare the similarity of documents with Simhash algorithm?

2 Answers2