How to measure similarity of 2 strings aside from computing distance

Question

I am creating a program that checks if the word is simplified word(txt, msg, etc.) and if it is simplified it finds the correct spelling like txt=text, msg=message. Iam using the NHunspell Suggest Method in c# which suggest all possible results.

The problem is if I inputted "txt" the result is text,tat, tot, etc. I dont know how to select the correct word. I used Levenshtein Distance (C# - Compare String Similarity) but the results still results to 1.

Input: txt Result: text = 1, ext = 1 tit = 1

Can you help me how to get the meaning or the correct spelling of the simplified words? Example: msg

Seems like one way to improve it, would be to check if each of the results contain at least all characters of the input. — Corak, Jul 19 '13 at 14:46
I agree with Corak - for the example you gave, the correct option of the 3 results, "text", "ext" and "tit", was the only one to contain all of the letters, in the correct order, present in the input — Sam, Jul 19 '13 at 14:54
You seem to be trying to write a spell checker that also makes suggestions. This is not a trivial task. Do you have a problem with the programming, or coming up with an algorithm? — CodeCaster, Jul 19 '13 at 15:01
both I am finding an algorithm to find the correct word and how to program the algorithm. — MMakati, Jul 19 '13 at 15:04

Tim Schmelter · Answer 1 · 2013-07-19T15:07:18.147

I have tested your input with your sample data and only text has a distance of 25 whereas the other have a distance of 33. Here's my code:

string input = "TXT";
string[] words = new[]{"text","tat","tot"};
var levenshtein = new Levenshtein();  
const int maxDistance = 30;

var distanceGroups = words
        .Select(w => new
        {
            Word = w,
            Distance = levenshtein.iLD(w.ToUpperInvariant(), input)
        })
        .Where(x => x.Distance <= maxDistance)
        .GroupBy(x => x.Distance)
        .OrderBy(g => g.Key)
        .ToList();
foreach (var topCandidate in distanceGroups.First())
    Console.WriteLine("Word:{0} Distance:{1}", topCandidate.Word, topCandidate.Distance);

and here is the levenshtein class:

public class Levenshtein
{
    ///*****************************
    /// Compute Levenshtein distance 
    /// Memory efficient version
    ///*****************************
    public int iLD(String sRow, String sCol)
    {
        int RowLen = sRow.Length;  // length of sRow
        int ColLen = sCol.Length;  // length of sCol
        int RowIdx;                // iterates through sRow
        int ColIdx;                // iterates through sCol
        char Row_i;                // ith character of sRow
        char Col_j;                // jth character of sCol
        int cost;                   // cost

        /// Test string length
        if (Math.Max(sRow.Length, sCol.Length) > Math.Pow(2, 31))
            throw (new Exception("\nMaximum string length in Levenshtein.iLD is " + Math.Pow(2, 31) + ".\nYours is " + Math.Max(sRow.Length, sCol.Length) + "."));

        // Step 1

        if (RowLen == 0)
        {
            return ColLen;
        }

        if (ColLen == 0)
        {
            return RowLen;
        }

        /// Create the two vectors
        int[] v0 = new int[RowLen + 1];
        int[] v1 = new int[RowLen + 1];
        int[] vTmp;



        /// Step 2
        /// Initialize the first vector
        for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
        {
            v0[RowIdx] = RowIdx;
        }

        // Step 3

        /// Fore each column
        for (ColIdx = 1; ColIdx <= ColLen; ColIdx++)
        {
            /// Set the 0'th element to the column number
            v1[0] = ColIdx;

            Col_j = sCol[ColIdx - 1];


            // Step 4

            /// Fore each row
            for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
            {
                Row_i = sRow[RowIdx - 1];


                // Step 5

                if (Row_i == Col_j)
                {
                    cost = 0;
                }
                else
                {
                    cost = 1;
                }

                // Step 6

                /// Find minimum
                int m_min = v0[RowIdx] + 1;
                int b = v1[RowIdx - 1] + 1;
                int c = v0[RowIdx - 1] + cost;

                if (b < m_min)
                {
                    m_min = b;
                }
                if (c < m_min)
                {
                    m_min = c;
                }

                v1[RowIdx] = m_min;
            }

            /// Swap the vectors
            vTmp = v0;
            v0 = v1;
            v1 = vTmp;

        }

        // Step 7

        /// Value between 0 - 100
        /// 0==perfect match 100==totaly different
        /// 
        /// The vectors where swaped one last time at the end of the last loop,
        /// that is why the result is now in v0 rather than in v1
        //System.Console.WriteLine("iDist=" + v0[RowLen]);
        int max = System.Math.Max(RowLen, ColLen);
        return ((100 * v0[RowLen]) / max);
    }


    ///*****************************
    /// Compute the min
    ///*****************************

    private int Minimum(int a, int b, int c)
    {
        int mi = a;

        if (b < mi)
        {
            mi = b;
        }
        if (c < mi)
        {
            mi = c;
        }

        return mi;
    }

    ///*****************************
    /// Compute Levenshtein distance         
    ///*****************************
    public int LD(String sNew, String sOld)
    {
        int[,] matrix;              // matrix
        int sNewLen = sNew.Length;  // length of sNew
        int sOldLen = sOld.Length;  // length of sOld
        int sNewIdx; // iterates through sNew
        int sOldIdx; // iterates through sOld
        char sNew_i; // ith character of sNew
        char sOld_j; // jth character of sOld
        int cost; // cost

        /// Test string length
        if (Math.Max(sNew.Length, sOld.Length) > Math.Pow(2, 31))
            throw (new Exception("\nMaximum string length in Levenshtein.LD is " + Math.Pow(2, 31) + ".\nYours is " + Math.Max(sNew.Length, sOld.Length) + "."));

        // Step 1

        if (sNewLen == 0)
        {
            return sOldLen;
        }

        if (sOldLen == 0)
        {
            return sNewLen;
        }

        matrix = new int[sNewLen + 1, sOldLen + 1];

        // Step 2

        for (sNewIdx = 0; sNewIdx <= sNewLen; sNewIdx++)
        {
            matrix[sNewIdx, 0] = sNewIdx;
        }

        for (sOldIdx = 0; sOldIdx <= sOldLen; sOldIdx++)
        {
            matrix[0, sOldIdx] = sOldIdx;
        }

        // Step 3

        for (sNewIdx = 1; sNewIdx <= sNewLen; sNewIdx++)
        {
            sNew_i = sNew[sNewIdx - 1];

            // Step 4

            for (sOldIdx = 1; sOldIdx <= sOldLen; sOldIdx++)
            {
                sOld_j = sOld[sOldIdx - 1];

                // Step 5

                if (sNew_i == sOld_j)
                {
                    cost = 0;
                }
                else
                {
                    cost = 1;
                }

                // Step 6

                matrix[sNewIdx, sOldIdx] = Minimum(matrix[sNewIdx - 1, sOldIdx] + 1, matrix[sNewIdx, sOldIdx - 1] + 1, matrix[sNewIdx - 1, sOldIdx - 1] + cost);

            }
        }

        // Step 7

        /// Value between 0 - 100
        /// 0==perfect match 100==totaly different
        //System.Console.WriteLine("Dist=" + matrix[sNewLen, sOldLen]);
        int max = System.Math.Max(sNewLen, sOldLen);
        return (100 * matrix[sNewLen, sOldLen]) / max;
    }
}

+1 for quite an elegant implementation. Tell me, I implemented the `SOUNDEX` algorithm from SQL. What's the difference between these two? I can of course read the code, but I wanted to get from you a high level picture because clearly you understand this algorithm and probably understand the one I implemented. Does the `SOUNDEX` routine for example have some holes in it **that you know of?** — Mike Perrenoud, Jul 19 '13 at 15:20
Thanks, but the input word must be upper case? if I change the input word to lower case the error is "Sequence contains no elements". — MMakati, Jul 19 '13 at 15:22
@MMakati: So what's the problem with `input.ToUpperInvariant()` at the beginning? I must admit that i have copied this from my code where the case didn't matter. I don't know if it matters for you. @Michael: Actually this implementation is from [**here**](http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm). So i'm not a levenshtein-algorithm expert. — Tim Schmelter, Jul 19 '13 at 15:24
I used the toUpper() and it works, I am thinking if the case affects the results. but anyways this answer are helpful and working. Thank you @TimSchmelter. — MMakati, Jul 19 '13 at 15:29
@MMakati: Of course the case affects the result, that's why i've compensated this effect with `ToUpper`. — Tim Schmelter, Jul 19 '13 at 15:35

score 0 · Answer 2 · answered Jul 19 '13 at 14:55

0

Not a complete solution, just a hopefully helpful suggestion...

It seems to me that people are unlikely to use simplifications that are as long as the correct word, so you could at least filter out all results whose length <= the input's length.

answered Jul 19 '13 at 14:55

Martin

5,392
30
39

Mike Perrenoud · Answer 3 · 2013-07-19T15:43:11.300

You really need to implement the SOUNDEX routine that exists in SQL. I've done that in the following code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Soundex
{
    class Program
    {
        static char[] ignoreChars = new char[] { 'a', 'e', 'h', 'i', 'o', 'u', 'w', 'y' };
        static Dictionary<char, int> charVals = new Dictionary<char, int>()
        {
            {'b',1},
            {'f',1},
            {'p',1},
            {'v',1},
            {'c',2},
            {'g',2},
            {'j',2},
            {'k',2},
            {'q',2},
            {'s',2},
            {'x',2},
            {'z',2},
            {'d',3},
            {'t',3},
            {'l',4},
            {'m',5},
            {'n',5},
            {'r',6}
        };

        static void Main(string[] args)
        {
            Console.WriteLine(Soundex("txt"));
            Console.WriteLine(Soundex("text"));
            Console.WriteLine(Soundex("ext"));
            Console.WriteLine(Soundex("tit"));
            Console.WriteLine(Soundex("Cammmppppbbbeeelll"));
        }

        static string Soundex(string s)
        {
            s = s.ToLower();
            StringBuilder sb = new StringBuilder();
            sb.Append(s.First());
            foreach (var c in s.Substring(1))
            {
                if (ignoreChars.Contains(c)) { continue; }

                // if the previous character yields the same integer then skip it
                if ((int)char.GetNumericValue(sb[sb.Length - 1]) == charVals[c]) { continue; }

                sb.Append(charVals[c]);
            }
            return string.Join("", sb.ToString().Take(4)).PadRight(4, '0');
        }
    }
}

See, with this code, the only match out of the examples you gave would be text. Run the console application and you'll see the output (i.e. txt would match text).

score 0 · Answer 4 · edited Feb 26 '20 at 15:34

0

One method I think programs like word uses to correct spellings, is to use NLP (Natural Language Processing) techniques to get the order of Nouns/Adjectives used in the context of the spelling mistakes.. then comparing that to known sentence structures they can estimate 70% chance the spelling mistake was a noun and use that information to filter the corrected spellings.

SharpNLP looks like a good library but I haven't had a chance to fiddle with it yet. To build a library of known sentence structures BTW, in uni we applied our algorithms to public domain books.

check out sams simMetrics library I found on SO (download here, docs here) for loads more options for algorithms to use besides Levenshtein distance.

edited Feb 26 '20 at 15:34

Tangurena

2,121
1
22
41

answered Jul 19 '13 at 15:24

Dead.Rabit

1,965
1
28
46

I must try this one, the program that I am making is really for my thesis (emotion on twitter). I am trying to check if the tweets have shortcut words. thanks. – MMakati Jul 19 '13 at 15:31
Sorry if my post is confusing, it helps me to think of this as a spelling correction system.. Just a thought, but is there any way you could parse these wiki pages to build a more comprehensive/specialised dictionary for your purpose: http://en.wikipedia.org/wiki/List_of_acronyms:_A – Dead.Rabit Jul 23 '13 at 09:04

score -1 · Answer 5 · answered Jul 19 '13 at 15:03

Expanding on my comment, you could use regex to search for a result that is an 'expansion' of the input. Something like this:

private int stringSimilarity(string input, string result)
{
    string regexPattern = ""
    foreach (char c in input)
        regexPattern += input + ".*"

    Match match = Regex.Match(result, regexPattern,
    RegexOptions.IgnoreCase);

    if (match.Success)
        return 1;
    else
        return 0;
}

Ignore the 1 and the 0 - I don't know how similarity valuing works.

How to measure similarity of 2 strings aside from computing distance

5 Answers5