2

I'm creating a program that reads a scanned hand written document and coverts it to text. The recognized words must come from a dictionary of about 300 words that I create. As an example, if the hand written word is recognized as "heilo", but my dictionary only contains "hello" and "world", it should convert it to "hello". However, if it recognized it as "planet", it shouldn't match it to anything. I think a possible approach would be to create a score of how closely the recognized word matches each word in the dictionary. If it doesn't get a minimum score, then no match is found.

I'm writing the application in C#. Are there any libraries/examples available that be do something like this, or would I have to code everything from scratch?

Thanks

Kritz
  • 7,099
  • 12
  • 43
  • 73

4 Answers4

5

There is nothing in the standard libraries to compute the distance between words, but there are plenty of examples you can find on the internet: look up "edit distance" or "Levenshtein distance". The idea is to measure the similarity in terms of the number of changes to the first string in order to make it a second string. The distance between "heil" and "hello" is 2, because you need to replace "i" with "l" (first edit), and then append an "o" (the second edit).

When looking for an implementation or implementing your own, avoid the trivial implementation with a 2D array, because it's not memory-efficient. Use the modification with O(min(m,n)) memory requirements instead of the "naive" O(m*n).

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
1

I have no lib at hand to do what you need but searching the web knowing that you want to calculate the Levenshtein Distance might help you in your search.

Christoph Grimmer
  • 4,210
  • 4
  • 40
  • 64
  • 1
    -1: Answers solely based on a reference are not considered a good answer. Your answer must still be valid, even if the link is broken. – Martin Mulder May 25 '13 at 12:08
  • I think the answer _is_ still valid even if the link is broken. Maybe I should have made a comment of it... But instead of simply telling the OP to search for information regarding the Levenshtein Distance I put in a reference to the relevant Wikipedia article as added value and not as the main content of the answer. – Christoph Grimmer May 25 '13 at 20:17
1

Perhaps you should start with a spell checker - there are a number of libraries available that do this.

Community
  • 1
  • 1
Richard
  • 29,854
  • 11
  • 77
  • 120
  • -1: Answers solely based on a reference are not considered a good answer. Your answer must still be valid, even if the link is broken. – Martin Mulder May 25 '13 at 12:09
  • 1
    Neither this, nor the other answer you pasted this comment on, are based solely on reference. They give a name of something to try, with an example link to start from. Both answers remain valid with no link. – Richard May 25 '13 at 12:15
0

There are a few c# snippets online that will get the ball rolling:

Levenshtein: http://www.dotnetperls.com/levenshtein

Boyer-Moore: http://www-igm.univ-mlv.fr/~lecroq/string/node15.html#SECTION00150

Based on those, you can easily implement your own Word Matcher module.

  • 1
    -1: Answers solely based on a reference are not considered a good answer. Your answer must still be valid, even if the link is broken. – Martin Mulder May 25 '13 at 12:38