Ruby compare two strings similarity percentage

Question

Id like to compare two strings in Ruby and find their similarity

I've had a look at the Levenshtein gem but it seems this was last updated in 2008 and I can't find documentation how to use it. With some blogs suggesting its broken

I tried the text gem with Levenshtein but it gives an integer (smaller is better)

Obviously if the two strings are of variable length I run into problems with the Levenshtein Algorithm (Say comparing two names, where one has a middle name and one doesnt).

What would you suggest I do to get a percentage comparison?

Edit: Im looking for something similar to PHP's similar text

possibly duplicate of http://stackoverflow.com/questions/4761793/how-to-do-advanced-string-comparison-in-ruby — Fredrik Pihl, Mar 22 '12 at 12:17
This generates a list of differences, im looking for a % similarity — Tarang, Mar 22 '12 at 12:19
If the strings are of different length, which one should be taken as the base for calculating the percentage? — Michael Kohl, Mar 22 '12 at 12:19
The longer one would be better? Im trying to go through a list of names to match ones from one column to another to the highest likeness (the ones on one side have middle names or dashes) — Tarang, Mar 22 '12 at 12:21
possible to use levenshtein copmarison and convert it to percents, as it is suggested here: http://stackoverflow.com/questions/10405440/percentage-rank-of-matches-using-levenshtein-distance-matching — nazar kuliyev, Jul 03 '16 at 06:38
Good answers here: https://stackoverflow.com/questions/16323571/measure-the-distance-between-two-strings-with-ruby — Paulo Belo, Jan 31 '23 at 11:46

Michael Kohl · Accepted Answer · 2022-04-12T02:45:46.190

19

I think your question could do with some clarifications, but here's something quick and dirty (calculating as percentage of the longer string as per your clarification above):

def string_difference_percent(a, b)
  longer = [a.size, b.size].max
  same = a.each_char.zip(b.each_char).count { |a,b| a == b }
  (longer - same) / a.size.to_f
end

I'm still not sure how much sense this percent difference you are looking for makes, but this should get you started at least.

It's a bit like Levensthein distance, in that it compares the strings character by character. So if two names differ only by the middle name, they'll actually be very different.

edited Apr 12 '22 at 02:45

answered Mar 22 '12 at 12:25

Michael Kohl

66,324
14
138
158

Can someone explain the 'same' bit? So it loops over each character, while the zip create an array for each character in string A with - what I would expect would be - every character in string B. How does the second each_char know what index to concatenate into the array? – Jack Rothrock Mar 14 '17 at 20:26
Also, this calculation doesn't work well when there is one character changed at the beginning. – Jack Rothrock Mar 14 '17 at 20:27
1

Beware of the **a** in the Select, because it clears the variable passed by parameter. It is better to use other letters. `same = a.each_char.zip(b.each_char).select{ |c,d| c == d }.size` – sesperanto Apr 26 '17 at 12:03
1

It just shadows it inside the block. – Michael Kohl Apr 26 '17 at 13:49
1

`same = a.each_char.zip(b.each_char).count{ |c,d| c == d }` – Navid EMAD Apr 11 '22 at 18:15
Will return a range between 0 and 1. if 0 it's ico, if 1 it's completely different – Raoni Sporteman Jan 04 '23 at 15:11

score 17 · Answer 2 · edited Nov 07 '13 at 21:29

17

There is now a ruby gem for similar_text. https://rubygems.org/gems/similar_text It provides a similar method that compares two strings and returns a number representing the percent similarity between the two strings.

edited Nov 07 '13 at 21:29

Zhihao

14,758
2
26
36

answered Nov 07 '13 at 21:10

user2837093

171
1
4

3

similar_text gem freezes on big strings, tried 143kb html page – nazar kuliyev Jul 03 '16 at 06:38

czerasz · Answer 3 · 2017-03-09T23:12:59.477

15

I can recommend the fuzzy-string-match gem.

You can use it like this (taken from the docs):

require "fuzzystringmatch"
jarow = FuzzyStringMatch::JaroWinkler.create(:native)
p jarow.getDistance("jones", "johnson")

It will return a score ~0.832 which tells how good those strings match.

edited Mar 09 '17 at 23:12

answered Sep 22 '15 at 11:18

czerasz

13,682
9
53
63

Ruby compare two strings similarity percentage

3 Answers3

Linked