Take a look at the gem Amatch, found here. This gem implements several distance algorithms. Also, read this other answer about the difference between Levenshtein and Jaro distance algorithms and check which one is more suitable for you.
TL;DR, here is a small snippet to help you get started in solving your problem, using the Amatch gem.
'subscription'.levenshtein_similar('SUBSCR|P||ON') #=> 0.0
'SUBSCRIPTION'.levenshtein_similar('SUBSCR|P||ON') #=> 0.75
'subscription'.jaro_similar('SUBSCR|P||ON') #=> 0.83
'SUBSCRIPTION'.jaro_similar('SUBSCR|P||ON') #=> 0.83
'subscription'.jarowinkler_similar('SUBSCR|P||ON') #=> 0.9
'SUBSCRIPTION'.jarowinkler_similar('SUBSCR|P||ON') #=> 0.9
If you want to evaluate if a given text has any occurrences of a word, try this:
def occurs?(text, target_word)
text_words = text.split(' ') # Splits the text into an array of words.
text_words.each do |word|
return true if word.jaro_similar(target_word) > 0.8
end
false
end
example_text = 'This text has the word SUBSCR|P||ON malformed.'
other_text = 'This text does not.'
occurs?(example_text, 'SUBSCRIPTION') #=> true
occurs?(other_text, 'SUBSCRIPTION') #=> false
Notice that you can call method #downcase
to the text words too, if you prefer. And you have to parse the text content of your original file first. Hope this helps!