2

I have a string which is "The White Horse is hungry"

Now, I need to match this with the possible pronunciations. Following are examples. (Think these as phonemes, OK I mean the way the user can pronounce)

The White Horse is hungary
The White Horse is not hungry
The White Horse is very hungry
The Horse is hungry
The Horse is hungries
White Horse is hungry
star wars..clone wars

Now you can see how similar the pronunciations could be and how different they could be. I can apply Levenshtein distance to find the difference. It gave me very accurate results. However, I also found if I can find a way to compare two phonemes for similarity for an example, when the user says the wrong phoneme, instead of adding or deleting phonemes, I can get even better result.

Anyone know a good algorithm for this? And an example/link to the c# implementation?

System.Windows.Form
  • 249
  • 1
  • 3
  • 11

2 Answers2

1

You may want to try the algorithm here: http://www.catalysoft.com/articles/StrikeAMatch.html

A sample implementation of it.

string input = "The White Horse is hungry";
string[] toTest = new string[]{
    "The White Horse is hungary",
    "The White Horse is not hungry",
    "The White Horse is very hungry",
    "The Horse is hungry",
    "The Horse is hungries",
    "White Horse is hungry",
    "star wars..clone wars",
};

string closest = toTest
                .Select(s => new
                {
                    Str = s,
                    Distance = s.Distance(input)
                })
                .OrderByDescending(x => x.Distance)
                .First().Str;

public static class StringSimilarity
{
    public static float Distance(this string s1, string s2)
    {
        var p1 = GetPairs(s1);
        var p2 = GetPairs(s2);
        return (2f * p1.Intersect(p2).Count()) / (p1.Count + p2.Count);
    }

    static List<string> GetPairs(string s)
    {
        if (s == null) return new List<string>();
        if (s.Length < 3) return new List<string>() { s };

        List<string> result = new List<string>();
        for (int i = 0; i < s.Length - 1; i++)
        {
            result.Add(s.Substring(i, 2).ToLower(CultureInfo.InvariantCulture));
        }
        return result;
    }
}
I4V
  • 34,891
  • 6
  • 67
  • 79
  • errr, Can you please present tis as one code? – System.Windows.Form May 31 '13 at 07:36
  • wat is Distance = s.Distance(input) ? Where is Distance class? – System.Windows.Form May 31 '13 at 07:38
  • @System.Windows.Form it is an *extension* method defined in class `StringSimilarity`. you can also use it as `StringSimilarity.Distance(s,input)` – I4V May 31 '13 at 07:41
  • tHanks a lot for this algorithm. Another question, does this same code works with symbols which are International Phonetic Alphabet characters? – System.Windows.Form May 31 '13 at 08:35
  • @System.Windows.Form maybe you can use [this trick](http://stackoverflow.com/questions/13769202/removing-replacing-international-characters) – I4V May 31 '13 at 08:50
  • OK, big issue. This didnt return me the percentage value as the website does. Instead it this gave me letters! – System.Windows.Form May 31 '13 at 09:04
  • @System.Windows.Form remove the `.Str` at the end (`.First().Str;`). Now you have both value, Distance + string. (of course change declaration of `closest` from `string` to `var`) – I4V May 31 '13 at 09:20
0

If not Levenshtein distance, what about Fuzzy approach or LCS.

S.N
  • 4,910
  • 5
  • 31
  • 51