Comparing strings with tolerance

Question

I'm looking for a way to compare a string with an array of strings. Doing an exact search is quite easy of course, but I want my program to tolerate spelling mistakes, missing parts of the string and so on.

Is there some kind of framework which can perform such a search? I'm having something in mind that the search algorithm will return a few results order by the percentage of match or something like this.

D'Arcy Rittich · Accepted Answer · 2010-02-26T19:53:07.797

You could use the Levenshtein Distance algorithm.

"The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character." - Wikipedia.com

This one is from dotnetperls.com:

using System;

/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
    /// <summary>
    /// Compute the distance between two strings.
    /// </summary>
    public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

class Program
{
    static void Main()
    {
        Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
        Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
        Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
    }
}

You may in fact prefer to use the Damerau-Levenshtein distance algorithm, which also allows characters to be transposed, which is a common human error in data entry. You'll find a C# implementation of it here.

I'd have to argue against Levenshtein Distance in this case. While it's great for figuring out how different two strings are, spelling mistakes more often than not retain the proper phonetics. For example, the LD algorithm will probably *not* indicate that "cool cat" and "kool kat" are similar (which is what I believe the poster wishes) whereas Soundex and Metaphone are more likely to indicate the similarity between those words/phrases. — casperOne, Feb 26 '10 at 20:05
@casperOne: hard to say without knowing the data set it is being applied to, but agree there is no one-size-fits-all approach. I'm a big fan of double metaphone. — D'Arcy Rittich, Feb 26 '10 at 20:12
@RedFilter hi.. i have used levenshtein distance... but i am actually comparing countries or regions of the world. so if i keep tolerance as 2 then Austria and Australia are shown same. at the same time, USA and United States are shown different. what can i do for this problem? — Jay Nirgudkar, Jun 10 '15 at 06:41
@JayNirgudkar In this case I would have additional data for nicknakes/abbreviations that I also compare against. — D'Arcy Rittich, Jun 10 '15 at 16:18

casperOne · Answer 2 · 2021-05-10T12:06:36.427

There is nothing in the .NET framework that will help you with this out-of-the-box.

The most common spelling mistakes are those where the letters are a decent phonetic representation of the word, but not the correct spelling of the word.

For example, it could be argued that the words sword and sord (yes, that's a word) have the same phonetic roots (they sound the same when you pronounce them).

That being said, there are a number of algorithms that you can use to translate words (even mispelled ones) into phonetic variants.

The first is Soundex. It's fairly simple to implement and there are a fair number of .NET implementations of this algorithm. It's rather simple, but it gives you real values you can compare to each other.

Another is Metaphone. While I can't find a native .NET implementation of Metaphone, the link provided has links to a number of other implementations which could be converted. The easiest to convert would probably be the Java implementation of the Metaphone algorithm.

It should be noted that the Metaphone algorithm has gone through revisions. There is Double Metaphone (which has a .NET implementation) and Metaphone 3. Metaphone 3 is a commercial application, but has a 98% accuracy rate compared to an 89% accuracy rate for the Double Metaphone algorithm when run against a database of common English words. Depending on your need, you might want to look for (in the case of Double Metaphone) or purchase (in the case of Metaphone 3) the source for the algorithm and convert or access it through the P/Invoke layer (there are C++ implementations abound).

Metaphone and Soundex differ in the sense that Soundex produces fixed length numeric keys, whereas Metaphone produces keys of different length, so the results will be different. In the end, both will do the same kind of comparison for you, you just have to find out which suits your needs the best, given your requirements and resources (and intolerance levels for the spelling mistakes, of course).

score 17 · Answer 3 · answered Nov 23 '16 at 22:14

Here is an implementation of the LevenshteinDistance method that uses far less memory while producing the same results. This is a C# adaptation of the pseudo code found in this wikipedia article under the "Iterative with two matrix rows" heading.

public static int LevenshteinDistance(string source, string target)
{
    // degenerate cases
    if (source == target) return 0;
    if (source.Length == 0) return target.Length;
    if (target.Length == 0) return source.Length;

    // create two work vectors of integer distances
    int[] v0 = new int[target.Length + 1];
    int[] v1 = new int[target.Length + 1];

    // initialize v0 (the previous row of distances)
    // this row is A[0][i]: edit distance for an empty s
    // the distance is just the number of characters to delete from t
    for (int i = 0; i < v0.Length; i++)
        v0[i] = i;

    for (int i = 0; i < source.Length; i++)
    {
        // calculate v1 (current row distances) from the previous row v0

        // first element of v1 is A[i+1][0]
        //   edit distance is delete (i+1) chars from s to match empty t
        v1[0] = i + 1;

        // use formula to fill in the rest of the row
        for (int j = 0; j < target.Length; j++)
        {
            var cost = (source[i] == target[j]) ? 0 : 1;
            v1[j + 1] = Math.Min(v1[j] + 1, Math.Min(v0[j + 1] + 1, v0[j] + cost));
        }

        // copy v1 (current row) to v0 (previous row) for next iteration
        for (int j = 0; j < v0.Length; j++)
            v0[j] = v1[j];
    }

    return v1[target.Length];
}

Here is a function that will give you the percentage similarity.

/// <summary>
/// Calculate percentage similarity of two strings
/// <param name="source">Source String to Compare with</param>
/// <param name="target">Targeted String to Compare</param>
/// <returns>Return Similarity between two strings from 0 to 1.0</returns>
/// </summary>
public static double CalculateSimilarity(string source, string target)
{
    if ((source == null) || (target == null)) return 0.0;
    if ((source.Length == 0) || (target.Length == 0)) return 0.0;
    if (source == target) return 1.0;

    int stepsToSame = LevenshteinDistance(source, target);
    return (1.0 - ((double)stepsToSame / (double)Math.Max(source.Length, target.Length)));
}

With regards to performance, these methods are great for small strings but not so great for large strings. — Fuzzy Analysis, Jan 05 '23 at 04:00

score 7 · Answer 4 · answered Jan 14 '11 at 05:54

7

Your other option is to compare phonetically using Soundex or Metaphone. I just completed an article that presents C# code for both algorithms. You can view it at http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex.

answered Jan 14 '11 at 05:54

Jonathan Wood

65,341
71
269
466

score 4 · Answer 5 · answered Feb 26 '10 at 19:43

Here are two methods that calculate the Levenshtein Distance between strings.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

Once you have the result, you'll need to define what value you want to use as a threshold for a match or not. Run the function on a bunch of sample data to get a good idea of how it works to help decide on your particular threshold.

    /// <summary>
    /// Calculates the Levenshtein distance between two strings--the number of changes that need to be made for the first string to become the second.
    /// </summary>
    /// <param name="first">The first string, used as a source.</param>
    /// <param name="second">The second string, used as a target.</param>
    /// <returns>The number of changes that need to be made to convert the first string to the second.</returns>
    /// <remarks>
    /// From http://www.merriampark.com/ldcsharp.htm
    /// </remarks>
    public static int LevenshteinDistance(string first, string second)
    {
        if (first == null)
        {
            throw new ArgumentNullException("first");
        }
        if (second == null)
        {
            throw new ArgumentNullException("second");
        }

        int n = first.Length;
        int m = second.Length;
        var d = new int[n + 1, m + 1]; // matrix

        if (n == 0) return m;
        if (m == 0) return n;

        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        for (int i = 1; i <= n; i++)
        {

            for (int j = 1; j <= m; j++)
            {
                int cost = (second.Substring(j - 1, 1) == first.Substring(i - 1, 1) ? 0 : 1); // cost
                d[i, j] = Math.Min(
                    Math.Min(
                        d[i - 1, j] + 1,
                        d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }

        return d[n, m];
    }

score 0 · Answer 6 · answered Dec 31 '22 at 07:18

using System;
 
public class Example
{
    public static int getEditDistance(string X, string Y)
    {
        int m = X.Length;
        int n = Y.Length;
 
        int[][] T = new int[m + 1][];
        for (int i = 0; i < m + 1; ++i) {
            T[i] = new int[n + 1];
        }
 
        for (int i = 1; i <= m; i++) {
            T[i][0] = i;
        }
        for (int j = 1; j <= n; j++) {
            T[0][j] = j;
        }
 
        int cost;
        for (int i = 1; i <= m; i++) {
            for (int j = 1; j <= n; j++) {
                cost = X[i - 1] == Y[j - 1] ? 0: 1;
                T[i][j] = Math.Min(Math.Min(T[i - 1][j] + 1, T[i][j - 1] + 1),
                        T[i - 1][j - 1] + cost);
            }
        }
 
        return T[m][n];
    }
 
    public static double findSimilarity(string x, string y) {
        if (x == null || y == null) {
            throw new ArgumentException("Strings must not be null");
        }
 
        double maxLength = Math.Max(x.Length, y.Length);
        if (maxLength > 0) {
            // optionally ignore case if needed
            return (maxLength - getEditDistance(x, y)) / maxLength;
        }
        return 1.0;
    }
 
 
    public static void Main()
    {
        string s1 = "Techie Delight";
        string s2 = "Tech Delight";
 
        double similarity = findSimilarity(s1, s2) * 100;
 
        Console.WriteLine(similarity);        // 85.71428571428571
    }
}

"Code Only" answers don't really help anyone learn anything. How is your answer any better than the existing ones? — Andy, Dec 31 '22 at 07:23

Comparing strings with tolerance

6 Answers6

Linked

Related