0

I am used to doing exact match checks on a lot of strings within Ruby, but I am wondering if there's a way to make this process more efficient.

For example, I am taking data from one area and comparing it to what's in active record. If www.domain.com is in one location, but domain.com is in the other location, the only way I can determine this is by removing www in one place or adding it in the other place.

Is there a way to be able to smartly determine if two pieces of data are alike?

In the above example, 10 out of 14 (or 71.42%) of the characters are alike, so I think it'd be safe to assume that the two records belong linked since they are just slightly different.

Is there a gem or a way to be able to intelligently make this kind of decision?

LewlSauce
  • 5,326
  • 8
  • 44
  • 91
  • `dowmawiwn.com`? Normally you normalize things into a simpler form that's consistent. – tadman Feb 06 '21 at 00:23
  • Yeah that's typically how we are doing it, but for this one particular project, I'm having to find similar data and make them consistent so that it's like that going forward. – LewlSauce Feb 06 '21 at 00:26
  • Possible duplicate with [Ruby compare two strings similarity percentage](https://stackoverflow.com/questions/9822078/ruby-compare-two-strings-similarity-percentage), and it linked a few similar questions, but this one is latest. – eux Feb 06 '21 at 01:22

1 Answers1

1

Fuzzy Matching with Damerau-Levenshtein Distance

Any kind of fuzzy matching will be somewhat dependent on how you choose to look at your data. For something like this, you can look at one of the many variants of the Levenshtein distance such as Damerau-Levenshtein. You can adjust MIN_SIMILARITY_PERCENT to tune the similarity index, which is calculated using the edit distance as a percentage of the characters found in the longest word in the pair.

require 'damerau-levenshtein'

class SimilarityIndex
  MIN_SIMILARITY_PERCENT = 70.0

  attr_reader :similarity_idx, :words

  def initialize word1, word2
    @words = word1, word2
    similar?
  end
  
  def edit_distance
    DamerauLevenshtein.distance *@words
  end

  def longest_word_length
    @words.max_by(&:length).size
  end

  def similar?
    e = edit_distance
    l = longest_word_length.to_f
    @similarity_idx = ((1 - (e/l)) * 100).round 2
    @similarity_idx >= MIN_SIMILARITY_PERCENT
  end
end

You can validate this with some test data. For example:

word_pairs = %w[
  www.domain.com
  domain.com

  www.example.com
  foobarbaz.example.com
]

word_pairs.each_slice(2).map do |word1, word2|
  s = SimilarityIndex.new word1, word2
  { words: s.words, similarity_idx: s.similarity_idx, similar?: s.similar? }
end

This test data generates the following results:

[{:words=>["www.domain.com", "domain.com"],
  :similarity_idx=>71.43,
  :similar?=>true},
 {:words=>["www.example.com", "foobarbaz.example.com"],
  :similarity_idx=>57.14,
  :similar?=>false}]
Todd A. Jacobs
  • 81,402
  • 15
  • 141
  • 199