Approximate text matching

Asked Jun 15 '17 at 07:34

Active Jun 15 '17 at 08:00

Viewed 108 times

I need to compare two pieces of text, say 200 words long. As these were obtained by OCR, discrepancies can arise at two levels:

words can be misspelled,
whole words can be missing or merged, or extra parasitic chunks inserted (in extreme cases, groups of words could be swapped).

The output of the recognition would be a similarity score. I don't think that matching the whole text as a long string can be efficient enough.

Are you aware of methods that specifically address this problem (two-level Levenshtein ??). Are there libraries available ?

(I am not looking for an OCR package.)

edited Jun 15 '17 at 08:00

asked Jun 15 '17 at 07:34

Cross-posted on Mathematics – Jun 15 '17 at 08:14
Possible duplicate: https://stackoverflow.com/questions/1721738/using-diff-or-anything-else-to-get-character-level-diff-between-text-files – shinobi Jun 15 '17 at 08:38
May be you want to implement 2 level Levenshtein. 1. word level then 2. character level for identified words. – arunk2 Jun 15 '17 at 12:15
@ArunKumar: do you have any reference ? – Jun 15 '17 at 13:24

0 Answers0