17

Id like to compare two strings in Ruby and find their similarity

I've had a look at the Levenshtein gem but it seems this was last updated in 2008 and I can't find documentation how to use it. With some blogs suggesting its broken

I tried the text gem with Levenshtein but it gives an integer (smaller is better)

Obviously if the two strings are of variable length I run into problems with the Levenshtein Algorithm (Say comparing two names, where one has a middle name and one doesnt).

What would you suggest I do to get a percentage comparison?

Edit: Im looking for something similar to PHP's similar text

Tarang
  • 75,157
  • 39
  • 215
  • 276
  • possibly duplicate of http://stackoverflow.com/questions/4761793/how-to-do-advanced-string-comparison-in-ruby – Fredrik Pihl Mar 22 '12 at 12:17
  • This generates a list of differences, im looking for a % similarity – Tarang Mar 22 '12 at 12:19
  • If the strings are of different length, which one should be taken as the base for calculating the percentage? – Michael Kohl Mar 22 '12 at 12:19
  • The longer one would be better? Im trying to go through a list of names to match ones from one column to another to the highest likeness (the ones on one side have middle names or dashes) – Tarang Mar 22 '12 at 12:21
  • possible to use levenshtein copmarison and convert it to percents, as it is suggested here: http://stackoverflow.com/questions/10405440/percentage-rank-of-matches-using-levenshtein-distance-matching – nazar kuliyev Jul 03 '16 at 06:38
  • Good answers here: https://stackoverflow.com/questions/16323571/measure-the-distance-between-two-strings-with-ruby – Paulo Belo Jan 31 '23 at 11:46

3 Answers3

19

I think your question could do with some clarifications, but here's something quick and dirty (calculating as percentage of the longer string as per your clarification above):

def string_difference_percent(a, b)
  longer = [a.size, b.size].max
  same = a.each_char.zip(b.each_char).count { |a,b| a == b }
  (longer - same) / a.size.to_f
end

I'm still not sure how much sense this percent difference you are looking for makes, but this should get you started at least.

It's a bit like Levensthein distance, in that it compares the strings character by character. So if two names differ only by the middle name, they'll actually be very different.

Michael Kohl
  • 66,324
  • 14
  • 138
  • 158
  • Can someone explain the 'same' bit? So it loops over each character, while the zip create an array for each character in string A with - what I would expect would be - every character in string B. How does the second each_char know what index to concatenate into the array? – Jack Rothrock Mar 14 '17 at 20:26
  • Also, this calculation doesn't work well when there is one character changed at the beginning. – Jack Rothrock Mar 14 '17 at 20:27
  • 1
    Beware of the **a** in the Select, because it clears the variable passed by parameter. It is better to use other letters. `same = a.each_char.zip(b.each_char).select{ |c,d| c == d }.size` – sesperanto Apr 26 '17 at 12:03
  • 1
    It just shadows it inside the block. – Michael Kohl Apr 26 '17 at 13:49
  • 1
    `same = a.each_char.zip(b.each_char).count{ |c,d| c == d }` – Navid EMAD Apr 11 '22 at 18:15
  • Will return a range between 0 and 1. if 0 it's ico, if 1 it's completely different – Raoni Sporteman Jan 04 '23 at 15:11
17

There is now a ruby gem for similar_text. https://rubygems.org/gems/similar_text It provides a similar method that compares two strings and returns a number representing the percent similarity between the two strings.

Zhihao
  • 14,758
  • 2
  • 26
  • 36
user2837093
  • 171
  • 1
  • 4
15

I can recommend the fuzzy-string-match gem.

You can use it like this (taken from the docs):

require "fuzzystringmatch"
jarow = FuzzyStringMatch::JaroWinkler.create(:native)
p jarow.getDistance("jones", "johnson")

It will return a score ~0.832 which tells how good those strings match.

czerasz
  • 13,682
  • 9
  • 53
  • 63