Using Hash functions to remove duplicate content/text

Question

I have a website with a lot of content and I am working on removing duplicates. For this I need to compare two strings and check their match percentage. I am using the ruby simhash gem: https://github.com/bookmate/simhash

The gem takes a string and returns an integer hash. I am not sure how to compare the two hashes.

X = 'King Gillette'.simhash(:split_by => //)

y = 'King Camp Gillette'.simhash(:split_by => //) 

X >> 13716569836

y >> 13809628900

Can I take the difference and then percentage? Does that indicate the difference between the strings?

I don't think that word means what you think it means... 'duplicity' means 'lying' or 'deception'. Perhaps you meant 'duplication'? — MrTheWalrus, Sep 19 '13 at 07:42
Seem's like what you are trying to do is pretty easy with this gem it gives you a couple of heuristics to compare strings. You would just need to define a threshold yourself https://github.com/anjlab/rubyfish Also check out this slightly related question: http://stackoverflow.com/questions/6395165/most-efficient-way-to-calculate-hamming-distance-in-ruby — Sam Figueroa, Sep 19 '13 at 08:03
I am aware of rubyfish and the heuristics it provides. I was looking for something better (I suppose?) or do you think something like Hamming distance or White Similarity would suffice? The strings are not just 2/3 words always, in half the cases they would be text paragraphs (500-3000 characters) — Varun Jain, Sep 19 '13 at 08:09
@VarunJain pls check my post if any problem let me know do you want integer difference of those string or just the difeerence of those strings? — Rajarshi Das, Sep 19 '13 at 08:14
How about using this [fuzzy-string-match gem](https://github.com/kiyoka/fuzzy-string-match) — tihom, Sep 19 '13 at 08:42

Rajarshi Das · Answer 1 · 2013-09-19T08:07:54.033

0

If you want to remove the duplicates of those strings way or you want difference between the strings If I am getting right then simply you can do this....

>>a1='King Gillette'.split(" ")  
>>=> ["King", "Gillette"]  
>>a2='King Camp Gillette'.split(" ")  
>>=> ["King", "Camp", "Gillette"]  
>> a2-a1  
>>["Camp"]

edited Sep 19 '13 at 08:07

answered Sep 19 '13 at 08:01

Rajarshi Das

11,778
6
46
74

Hi Rajarshi, I want a percentage similarity between two strings. They can be long as well and would contain text apart from proper nouns such as prepositions, connectors. I am not sure if subtracting would hold in such a case – Varun Jain Sep 19 '13 at 08:30
subtracting will give you exact strings those are not matched from the first array...as it is simple array difference – Rajarshi Das Sep 19 '13 at 08:32

Using Hash functions to remove duplicate content/text

1 Answers1