c# comparing strings with a bit of leniency

Question

I am currently working on a database project where we have some given names that need to be compared to some historical names in an existing database. These names belong to indigenous people in Malaysia are being gathered from anthropologists doing interviews in these remote areas months, if not years, apart. Many of these indigenous people are illiterate and don't have consistent pronunciations, much less spellings, of their name. This leads most anthropologists to make educated guesses at how they may be spelled. Most of these guesses are pretty consistent, only being a few letters off and look very similar to the human eye.

However, the human eye is obviously quite different than a computer's eye, and I'm unsure about how to best go about trying to compare names and see how "close" they are to each other. I'm certainly not looking for anything even remotely precise, but I was wondering if anybody knew of an existing method of doing this. Searching for solutions to this sort of is a nightmare given how close the problem is to the incredibly common problem of comparing strings.

Also, as a sidenote, this is the closest question I've seen to mine: Lenient string comparison. However, the big issue here is that this is using a dictionary search. These names are in an entirely different language that doesn't even really have a written dictionary, and the vast majority are being sounded out by third parties because the actual individuals don't speak English. Because of this, I need something that can do this sort of "percent match" but for 2 given strings rather than a dictionary of strings.

The other solution I've seen is computing the Levenshtein distance, but wasn't sure if there may be a solution that also took into account the "sound" of the word. For example, the difference between "Kook" and "Chuch" is very different in terms of Levenshtein distance, but is actually rather similar depending on how its said.

You probably want to phoneticize the words first, then weight the phonemes by distance There must be something out there already, see interesting articles https://www.microsoft.com/en-us/research/blog/a-phonetic-matching-made-in%CB%88h%C9%9Bv%C9%99n/ and https://stackabuse.com/phonetic-similarity-of-words-a-vectorized-approach-in-python/ — Charlieface, Mar 30 '21 at 02:15
Maybe something using soundex? Check out https://stackoverflow.com/q/11121936/5803406 — devNull, Mar 30 '21 at 02:38

c# comparing strings with a bit of leniency

0 Answers0