C# Looking for similar needle in haystack (for OCR)

Question

I've been working on an OCR program that accepts a photo with text in it (in this specific case, a driver's license) as well as a first name and a last name as arguments.

Once the software reads the id photo, I search for the first and last name in the recognized text. Unfortunately, as the image quality can be pretty low, it will sometimes not get the name quite right.

Is there a way I could look for a SIMILAR needle in a haystack? That is, look for any occurrences that are similar to the first/last name? For example:

Needle: campbell

Haystack: 
operaioxsllcence 
gcltdriver 
exries13NOV2020
carnpbeiljtttj
...

The string that would be close enough is "carnpbeil".

This is what I'm using now, and it only helps in very specific situations:

private bool SourceContains(string haystack, string needle)
    {
        bool ret = false;
        if (haystack.Contains(needle) ||
                haystack.Replace("l", "i").Contains(needle) ||
                haystack.Replace("i", "l").Contains(needle) ||
                haystack.Replace("0", "o").Contains(needle) ||
                haystack.Replace("o", "0").Contains(needle) ||
                haystack.Replace("j", "d").Contains(needle) ||
                haystack.Replace("d", "j").Contains(needle) ||
                haystack.Replace("i", "j").Contains(needle) ||
                haystack.Replace("j", "i").Contains(needle) ||
                haystack.Replace("e", "f").Contains(needle) ||
                haystack.Replace("f", "e").Contains(needle) ||
                haystack.Replace("r", "p").Contains(needle) ||
                haystack.Replace("p", "r").Contains(needle) ||
                haystack.Replace("s", "r").Contains(needle) ||
                haystack.Replace("r", "s").Contains(needle) ||
                haystack.Replace("r", "n").Contains(needle) ||
                haystack.Replace("n", "r").Contains(needle) ||
                haystack.Replace("k", "n").Contains(needle) ||
                haystack.Replace("n", "k").Contains(needle) ||
                haystack.Replace("h", "n").Contains(needle) ||
                haystack.Replace("n", "h").Contains(needle) ||
                haystack.Replace("k", "ll").Contains(needle) ||
                haystack.Replace("ll", "k").Contains(needle) ||
                haystack.Replace("ci", "d").Contains(needle) ||
                haystack.Replace("d", "ci").Contains(needle) ||
                haystack.Replace("cl", "d").Contains(needle) ||
                haystack.Replace("d", "cl").Contains(needle) ||
                haystack.Replace("m", "in").Contains(needle) ||
                haystack.Replace("in", "m").Contains(needle) ||
                haystack.Replace("rn", "m").Contains(needle) ||
                haystack.Replace("m", "rn").Contains(needle)
                )
        {
            ret = true;
        }
        return ret;
    }

score 0 · Answer 1 · answered Sep 09 '17 at 20:37

0

For each word in haystack calculate the levenshtein distance to needle. The word with the shortest distance is most likely to be your needle. Have a look at this question for implementations.

answered Sep 09 '17 at 20:37

Ulf Kristiansen

1,571
3
22
34

Unfortunately, the words detected are sometimes not split up properly. Like in the example I showed, "carnpbeil" would be the match for Campbell (the letters LOOK the same), but the "word" is "carnpbeiljtttj", as it also mashed the first name in with the last. I think that the only way I'm going to improve on this is to increase the quality of the images that are submitted. – Scott Dellinger Sep 14 '17 at 13:51

C# Looking for similar needle in haystack (for OCR)

1 Answers1