0

I need to compare two pieces of text, say 200 words long. As these were obtained by OCR, discrepancies can arise at two levels:

  • words can be misspelled,

  • whole words can be missing or merged, or extra parasitic chunks inserted (in extreme cases, groups of words could be swapped).

The output of the recognition would be a similarity score. I don't think that matching the whole text as a long string can be efficient enough.

Are you aware of methods that specifically address this problem (two-level Levenshtein ??). Are there libraries available ?

(I am not looking for an OCR package.)

0 Answers0