0

I have OCR scanned a large number of documents and need to identify a keyword within the scanned files. The problem is, because the OCR is not dependable - for example the word "SUBSCRIPTION" may end up being "SUBSCR|P||ON" - I will need to search for a near match rather than a full match.

Does anyone know how I can search a file for the word "SUBSCRIPTION" and return true if an 80% match is found?

Sean
  • 1
  • 1
  • 2
    Perhaps the [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) might be useful in this context. There are some implementations of this algorithm on [Rubygems](https://rubygems.org/search?query=Levenshtein). – spickermann Jul 26 '17 at 15:11

1 Answers1

0

Take a look at the gem Amatch, found here. This gem implements several distance algorithms. Also, read this other answer about the difference between Levenshtein and Jaro distance algorithms and check which one is more suitable for you.

TL;DR, here is a small snippet to help you get started in solving your problem, using the Amatch gem.

'subscription'.levenshtein_similar('SUBSCR|P||ON') #=> 0.0
'SUBSCRIPTION'.levenshtein_similar('SUBSCR|P||ON') #=> 0.75
'subscription'.jaro_similar('SUBSCR|P||ON')        #=> 0.83
'SUBSCRIPTION'.jaro_similar('SUBSCR|P||ON')        #=> 0.83
'subscription'.jarowinkler_similar('SUBSCR|P||ON') #=> 0.9
'SUBSCRIPTION'.jarowinkler_similar('SUBSCR|P||ON') #=> 0.9

If you want to evaluate if a given text has any occurrences of a word, try this:

def occurs?(text, target_word)
  text_words = text.split(' ') # Splits the text into an array of words.
  text_words.each do |word|
    return true if word.jaro_similar(target_word) > 0.8
  end
  false
end

example_text = 'This text has the word SUBSCR|P||ON malformed.'
other_text = 'This text does not.'

occurs?(example_text, 'SUBSCRIPTION') #=> true
occurs?(other_text, 'SUBSCRIPTION')   #=> false

Notice that you can call method #downcase to the text words too, if you prefer. And you have to parse the text content of your original file first. Hope this helps!

rodsoars
  • 401
  • 8
  • 16