I need to compare two pieces of text, say 200 words long. As these were obtained by OCR, discrepancies can arise at two levels:
words can be misspelled,
whole words can be missing or merged, or extra parasitic chunks inserted (in extreme cases, groups of words could be swapped).
The output of the recognition would be a similarity score. I don't think that matching the whole text as a long string can be efficient enough.
Are you aware of methods that specifically address this problem (two-level Levenshtein ??). Are there libraries available ?
(I am not looking for an OCR package.)