0

I have a website with a lot of content and I am working on removing duplicates. For this I need to compare two strings and check their match percentage. I am using the ruby simhash gem: https://github.com/bookmate/simhash

The gem takes a string and returns an integer hash. I am not sure how to compare the two hashes.

X = 'King Gillette'.simhash(:split_by => //)

y = 'King Camp Gillette'.simhash(:split_by => //) 

X >> 13716569836

y >> 13809628900

Can I take the difference and then percentage? Does that indicate the difference between the strings?

Varun Jain
  • 1,901
  • 7
  • 33
  • 66
  • 1
    I don't think that word means what you think it means... 'duplicity' means 'lying' or 'deception'. Perhaps you meant 'duplication'? – MrTheWalrus Sep 19 '13 at 07:42
  • Did the edit, thanks for pointing it out! – Varun Jain Sep 19 '13 at 07:50
  • Seem's like what you are trying to do is pretty easy with this gem it gives you a couple of heuristics to compare strings. You would just need to define a threshold yourself https://github.com/anjlab/rubyfish Also check out this slightly related question: http://stackoverflow.com/questions/6395165/most-efficient-way-to-calculate-hamming-distance-in-ruby – Sam Figueroa Sep 19 '13 at 08:03
  • I am aware of rubyfish and the heuristics it provides. I was looking for something better (I suppose?) or do you think something like Hamming distance or White Similarity would suffice? The strings are not just 2/3 words always, in half the cases they would be text paragraphs (500-3000 characters) – Varun Jain Sep 19 '13 at 08:09
  • @VarunJain pls check my post if any problem let me know do you want integer difference of those string or just the difeerence of those strings? – Rajarshi Das Sep 19 '13 at 08:14
  • How about using this [fuzzy-string-match gem](https://github.com/kiyoka/fuzzy-string-match) – tihom Sep 19 '13 at 08:42

1 Answers1

0

If you want to remove the duplicates of those strings way or you want difference between the strings If I am getting right then simply you can do this....

>>a1='King Gillette'.split(" ")  
>>=> ["King", "Gillette"]  
>>a2='King Camp Gillette'.split(" ")  
>>=> ["King", "Camp", "Gillette"]  
>> a2-a1  
>>["Camp"]  
Rajarshi Das
  • 11,778
  • 6
  • 46
  • 74
  • Hi Rajarshi, I want a percentage similarity between two strings. They can be long as well and would contain text apart from proper nouns such as prepositions, connectors. I am not sure if subtracting would hold in such a case – Varun Jain Sep 19 '13 at 08:30
  • subtracting will give you exact strings those are not matched from the first array...as it is simple array difference – Rajarshi Das Sep 19 '13 at 08:32