How to calculate score for Metaphone/Soundex name searching in .net

Question

I have to obtain a score for the name field in the search with soundex or Meta-phone matching. For Eg: if i searched "JOHN DOE" i took all the sounds like matching on this search parameter. It will return a vast records similar to its soundex or Meta-phone matching. So i need to provide a score based on the obtained data so that the most matched data can be taken or shown on top of the list.Like wise user can take 85% or 90% matching data from the list. Please help with technique to create score in c# for soundex or Meta-phone obtained values

score 2 · Answer 1 · edited May 23 '17 at 12:16

I'm assuming you search all your strings and filter out the ones which has ALL soundex codes in the query string. So for example, if query is "John Doe" then you would have have two soundex codes, one for John and other for DOE. So next, you would retrieve all strings that have at least these two soundex codes.

Now if you get too many records then you need to apply techniques from the domain of Information Retrieval to rank your results. There are unfortunately many ways to do it. I'll describe some of my favorite ways in increasing order of complexity:

Use edit distance to rank your strings. You would have function GetEditDistance(s1, s2) and it basically returns number of add/update/deletes you need to do in s1 to get s2. This is fairly simple and you can get code and more info from here: How to calculate distance similarity measure of given 2 strings?.
Use similarity metric such as Jaccard similarity. You basically take two strings and get ratio of count of common characters divided by count of all distinct characters. This is character-level Jaccard score. You can also do it token level. For example, token level Jaccard score between "John Doe" and "John Wolfenstein" is 1/3 but for "John Doe" and "John F. Doe", the score is 2/3. Other similarity metrics are Dice and Cosine which are also very easy to calculate and has dedicted Wikipedia pages.
Finally if you want to do it "properly" as in the IR book I linked above, then you need to first calculate TF/IDF. This essentially assigns a weight to each term that is in your records. If term is occurring too many times (like John) then its weight would be lower. If term is rather rare (like Wolfenstein) then its weight is higher. Once you have weights you basically use similarity metric I described in #2.

Updated for example in comment by OP

In your examplem the query is osama and results are osama,ossama,ussama,oswin,ASAMOAH. It looks to me that Dice coefficient or Cosine similarity would be best in your case. Calculating Dice coefficient is very easy so I'll use that here but you might want to experiment with Cosine similarity also.

To calculate character level Dice coefficient, use following formula:

Dice coefficient = 2 * (count of common characters between query and result) / (sum of all characters in query and result)

For example, Dice coefficient between osama and ossama is 2*5/(5+6)=0.91.

Below are the Dice for all results for query osama:

osama   osama   ->  1.00
osama   ossama  ->  0.91
osama   ussama  ->  0.72
osama   oswin   ->  0.40
osama   ASAMOAH ->  0.83

So the ranked results would be osama, ossama, ASAMOAH, ussama, oswin which looks reasonable to me.

i have two separate search for soundex and metaphone. If user searches using metaphone,i return those sounds like results then for eg: if user search osama, metaphone returns osama,ossama,ussama,oswin,ASAMOAH etc. but here the top pirority score should be given for ossama,ussama,osama. how that scoring techinique can be made. — deepu, Mar 25 '14 at 11:32

score 0 · Answer 2 · answered Mar 24 '14 at 08:05

I'm not entirely sure what you're asking here, but here's how I interpreted it for my answer:

Given two soundex or metaphone values, how can I evaluate which one is closer to a specific value?

(If that's not what you're asking, feel free to disregard the rest of this post.)

A few years ago I decided I wanted to know how spellcheckers worked. So, I looked at aspell and the like and it turned out they do something nearly identical to what you want to do. Say you have a value: TELEPHON. You need to find the closest word in a dictionary that matches that. You run it through some sort of phonetic algorithm, and you get a phonetic representation of TELEPHON, which might be TLPN. (Phonetic algorithms are weird.) Then you go through your list of key-value pairs (value is a word in your dictionary; key is the word run through the same phonetic algorithm) and you find the ones that are closest to TLPN. Those are your candidates.

It's a bit more difficult than that, but this is the type of thing you want to do, correct?

If that's the case, what you're looking for is a distance algorithm. They take two strings and measure how many edits it takes to turn one string into the other, and it provides a simple mechanism for comparing the two strings.

I used the Levenshtein Distance Algorithm. It's pretty simple and well understood. There are a few different modifications you can make to it, but if you're using a phonetic algorithm as your inputs they're probably unnecessary.

So yeah, take a look at that. (And forgive me for the long-winded answer.)

(Additional note: If you think phonetic algorithms are weird, you should see string distance algorithms. Try to run one out by hand some time -- you end up with a matrix that appears to be a random jumble of numbers... but the one in the bottom right-hand corner of your matrix is the right answer, somehow.)

Levenshtein Distance algorithm i have gone through but i having issues with long names. if user search osama, metaphone returns osama,ossama,ussama,oswin,ASAMOAH etc. but here the top pirority score should be given for ossama,ussama,osama where the returned result will be having firstname middlename and lastname which will be more than the searched names. i have to make scoring not regarding the distance but regarding simple matching — deepu, Mar 25 '14 at 11:35
The key to Levenshtein distance is that the longer the word, the more loose of a range you can have with the score. Metaphone should yield a 4-letter score, which means you'd ideally want a distance range of 0 -- strings that match that metaphone code. If you want to further limit it, you can run Levenshtein distance on the results you've filtered based on Metaphone and take the closest match from there. So if you are searching for "Osama", you'd get "Osama" because they're identical, followed by (I think) "Ossama" because the distance between that and the original score is 1. — Ari Roth, Mar 25 '14 at 21:35

score 0 · Answer 3 · answered Mar 08 '16 at 04:57

Here's a very lazy C# conversion of a metaphone 1 algo written in C and found at http://aspell.net/metaphone/metaphone-kuhn.txt. Unit test included and successfuly passed.

public class Metaphone
{
    const string VOWELS = "AEIOU";
    const string FRONTV = "EIY";   /* special cases for letters in FRONT of these */
    const string VARSON = "CSPTG"; /* variable sound--those modified by adding an "h"    */
    const string DOUBLE = ".";     /* let these double letters through */

    const char NULLCHAR = '\0';

    private int strchr(string s, char c)
    {
        if (s.IndexOf(c) < 0)
            return NULLCHAR;
        else
            return 1; //dummy value to indicate found because we don't use non NULL return
    }

    private void strncat(ref string s, string s2, int unusedLen)
    {
        s = s + s2;
    }

    private void strncat(ref string s, char c, int unusedLen)
    {
        s = s + new string(c, 1);
    }

    private int strlen(char[] s)
    {
        var i = 0;
        foreach (var c in s)
        {
            if (c == NULLCHAR)
                return i;
            i++;
        }
        return -1;
    }

    private int strlen(string s)
    {
        return s.Length;
    }

    private void ShiftLeftByOne(char[] chars, int firstDestIndex)
    {
        for (var i = firstDestIndex; i < chars.Length - 1; i++)
            chars[i] = chars[i + 1];
    }

    private bool startsWith(char[] chars, char c1, char c2)
    {
        return chars[0] == c1 && chars[1] == c2;
    }

    public string GetPhonetic(string name, int maxLen)
    {
        string metaph = "";
        int ii, jj, Lng, lastChr;
        bool silent, hard;

        char curLtr, prevLtr, nextLtr, nextLtr2, nextLtr3;

        bool vowelAfter, vowelBefore, frontvAfter;

        char[] wname = new char[60];
        char[] ename = wname;

        jj = 0;
        for (ii = 0; ii < name.Length; ii++)
        {
            if (char.IsLetter(name[ii]))
            {
                ename[jj] = char.ToUpper(name[ii]);
                jj++;
            }
        }
        ename[jj] = NULLCHAR;

        if (strlen(ename) == 0)
            return null;

        /* if ae, gn, kn, pn, wr then drop the first letter */
        //char *chrptr, *chrptr1;
        //if ((chrptr = strchr2(excpPAIR, ename[0])) != NULLCHAR)
        //{
        //    chrptr1 = nextLTR + (chrptr - excpPAIR);
        //    if (*chrptr1 == ename[1])
        if (
            startsWith(ename, 'A', 'E') ||
            startsWith(ename, 'G', 'N') ||
            startsWith(ename, 'K', 'N') ||
            startsWith(ename, 'P', 'N') ||
            startsWith(ename, 'W', 'R')
            )
            ShiftLeftByOne(ename, 0);

        /* change x to s */
        if (ename[0] == 'X')
            ename[0] = 'S';

        /* get rid of the "h" in "wh" */
        //if (strncmp(ename, "WH", 2) == 0)
        if (startsWith(ename, 'W', 'H'))
            ShiftLeftByOne(ename, 1);

        Lng = strlen(ename);
        lastChr = Lng - 1;   /* index to last character in string makes code easier*/

        /* Remove an S from the end of the string */
        if (ename[lastChr] == 'S')
        {
            ename[lastChr] = NULLCHAR;
            Lng = strlen(ename);
            lastChr = Lng - 1;
        }

        for (ii = 0; ((strlen(metaph) < maxLen) && (ii < Lng)); ii++)
        {
            curLtr = ename[ii];

            vowelBefore = false;
            prevLtr = ' ';
            if (ii > 0)
            {
                prevLtr = ename[ii - 1];
                if (strchr(VOWELS, prevLtr) != NULLCHAR)
                    vowelBefore = true;
            }
            /* if first letter is a vowel KEEP it */
            if (ii == 0 && (strchr(VOWELS, curLtr) != NULLCHAR))
            {
                strncat(ref metaph, curLtr, 1);
                continue;
            }

            vowelAfter = false;
            frontvAfter = false;
            nextLtr = ' ';
            if (ii < lastChr)
            {
                nextLtr = ename[ii + 1];
                if (strchr(VOWELS, nextLtr) != NULLCHAR)
                    vowelAfter = true;
                if (strchr(FRONTV, nextLtr) != NULLCHAR)
                    frontvAfter = true;
            }
            /* skip double letters except ones in list */
            if (curLtr == nextLtr && (strchr(DOUBLE, nextLtr) == NULLCHAR))
                continue;

            nextLtr2 = ' ';
            if (ii < (lastChr - 1))
                nextLtr2 = ename[ii + 2];

            nextLtr3 = ' ';
            if (ii < (lastChr - 2))
                nextLtr3 = ename[ii + 3];

            switch (curLtr)
            {
                case 'B':
                    silent = false;
                    if (ii == lastChr && prevLtr == 'M')
                        silent = true;
                    if (!silent)
                        strncat(ref metaph, curLtr, 1);
                    break;

                /*silent -sci-,-sce-,-scy-;  sci-, etc OK*/
                case 'C':
                    if (!(ii > 1 && prevLtr == 'S' && frontvAfter))
                        if (ii > 0 && nextLtr == 'I' && nextLtr2 == 'A')
                            strncat(ref metaph, "X", 1);
                        else
                            if (frontvAfter)
                                strncat(ref metaph, "S", 1);
                            else
                                if (ii > 1 && prevLtr == 'S' && nextLtr == 'H')
                                    strncat(ref metaph, "K", 1);
                                else
                                    if (nextLtr == 'H')
                                        if (ii == 0 && (strchr(VOWELS, nextLtr2) == NULLCHAR))
                                            strncat(ref metaph, "K", 1);
                                        else
                                            strncat(ref metaph, "X", 1);
                                    else
                                        if (prevLtr == 'C')
                                            strncat(ref metaph, "C", 1);
                                        else
                                            strncat(ref metaph, "K", 1);
                    break;

                case 'D':
                    if (nextLtr == 'G' && (strchr(FRONTV, nextLtr2) != NULLCHAR))
                        strncat(ref metaph, "J", 1);
                    else
                        strncat(ref metaph, "T", 1);
                    break;

                case 'G':
                    silent = false;
                    /* SILENT -gh- except for -gh and no vowel after h */
                    if ((ii < (lastChr - 1) && nextLtr == 'H')
                        && (strchr(VOWELS, nextLtr2) == NULLCHAR))
                        silent = true;

                    if ((ii == (lastChr - 3))
                        && nextLtr == 'N' && nextLtr2 == 'E' && nextLtr3 == 'D')
                        silent = true;
                    else
                        if ((ii == (lastChr - 1)) && nextLtr == 'N')
                            silent = true;

                    if (prevLtr == 'D' && frontvAfter)
                        silent = true;

                    if (prevLtr == 'G')
                        hard = true;
                    else
                        hard = false;

                    if (!silent)
                        if (frontvAfter && (!hard))
                            strncat(ref metaph, "J", 1);
                        else
                            strncat(ref metaph, "K", 1);
                    break;

                case 'H':
                    silent = false;
                    if (strchr(VARSON, prevLtr) != NULLCHAR)
                        silent = true;

                    if (vowelBefore && !vowelAfter)
                        silent = true;

                    if (!silent)
                        strncat(ref metaph, curLtr, 1);
                    break;

                case 'F':
                case 'J':
                case 'L':
                case 'M':
                case 'N':
                case 'R':
                    strncat(ref metaph, curLtr, 1);
                    break;

                case 'K':
                    if (prevLtr != 'C')
                        strncat(ref metaph, curLtr, 1);
                    break;

                case 'P':
                    if (nextLtr == 'H')
                        strncat(ref metaph, "F", 1);
                    else
                        strncat(ref metaph, "P", 1);
                    break;

                case 'Q':
                    strncat(ref metaph, "K", 1);
                    break;

                case 'S':
                    if (ii > 1 && nextLtr == 'I'
                        && (nextLtr2 == 'O' || nextLtr2 == 'A'))
                        strncat(ref metaph, "X", 1);
                    else
                        if (nextLtr == 'H')
                            strncat(ref metaph, "X", 1);
                        else
                            strncat(ref metaph, "S", 1);
                    break;

                case 'T':
                    if (ii > 1 && nextLtr == 'I'
                        && (nextLtr2 == 'O' || nextLtr2 == 'A'))
                        strncat(ref metaph, "X", 1);
                    else
                        if (nextLtr == 'H')         /* The=0, Tho=T, Withrow=0 */
                            if (ii > 0 || (strchr(VOWELS, nextLtr2) != NULLCHAR))
                                strncat(ref metaph, "0", 1);
                            else
                                strncat(ref metaph, "T", 1);
                        else
                            if (!(ii < (lastChr - 2) && nextLtr == 'C' && nextLtr2 == 'H'))
                                strncat(ref metaph, "T", 1);
                    break;

                case 'V':
                    strncat(ref metaph, "F", 1);
                    break;

                case 'W':
                case 'Y':
                    if (ii < lastChr && vowelAfter)
                        strncat(ref metaph, curLtr, 1);
                    break;

                case 'X':
                    strncat(ref metaph, "KS", 2);
                    break;

                case 'Z':
                    strncat(ref metaph, "S", 1);
                    break;
            }
        }

        /*  
         DON'T DO THIS NOW, REMOVING "S" IN BEGINNING HAS the same effect
         with plurals, in addition imbedded S's in the Metaphone are included:

            Lng = strlen(metaph);
            lastChr = Lng -1;
            if ( metaph[lastChr] == 'S' && Lng >= 3 ) 
                metaph[lastChr] = '\0';
        */

        return metaph;
    }

    /*
    [TestClass]
    public class UnitTest1
    {
        [TestMethod]
        public void Metaphone1()
        {
            var mp = new Metaphone();

            var pairs = new Dictionary<string, string>();
            pairs.Add("ANASTHA", "ANS0");
            pairs.Add("DAVIS-CARTER", "TFSKRTR");
            pairs.Add("ESCARMANT", "ESKRMNT ");
            pairs.Add("MCCALL", "MCL");
            pairs.Add("MCCROREY", "MCRR");
            pairs.Add("MERSEAL", "MRSL");
            pairs.Add("PIEURISSAINT", "PRSNT");
            pairs.Add("ROTMAN", "RTMN");
            pairs.Add("SCHEVEL", "SXFL");
            pairs.Add("SCHROM", "SXRM");
            pairs.Add("SEAL", "SL");
            pairs.Add("SPARR", "SPR");
            pairs.Add("STARLEPER", "STRLPR");
            pairs.Add("THRASH", "TRX");

            foreach (var pair in pairs)
            {
                var output = mp.GetPhonetic(pair.Key, 20);
                System.Diagnostics.Debug.WriteLine("{0} = {1} > {2}", pair.Key, pair.Value, output);
            }

        }
    }
    */
}

How to calculate score for Metaphone/Soundex name searching in .net

3 Answers3

Linked