I'm currently creating a program that can compute near-dupliate score within a corpus of text documents (+5000 docs). I'm using Simhash to generate a uniq footprint of a document (thanks to this github repo)
my datas are :
data = {
1: u'Im testing simhash algorithm.',
2: u'test of simhash algorithm',
3: u'This is simhash test.',
}
and this gives me 3 hash like this :
00100110101110100011111000100010010101011001000001110000111001011100110101001101111010100010001011001011000110000100110101100110
00001001110010000000011000001000110010001010000101010000001100000100100011100100110010100000010000000110001001010110000010000100
10001110101100000100101010000010010001011010001000000000101000101100001100100000110011000000011001000000000110000000100110000000
And now, how to compare those 3 hashes ? I know that I have to split them into blocks but dont have the exact method ?
What i want to do is the output all the duplicated documents (>70%) with their ID and the IDs of duplicates docs.
Can someone help ?