0

I'm looking to assess similarity (including case) between two strings and give a value between 0 and 1.

I tried the Levenshtein distance implementation but it only gives integers and does not compare inner alphabets.

For e.g. comparing "ABCD" and "Abcd" gives distance of 3 and "AOOO" also gives a distance of 3 but clearly "Abcd" is better match than "AOOO".

So compared to "ABCD" I want "ABcd" to be most similar then "Abcd" then "AOOO" then "AOOOO"

I've also looked here but I am not looking for a variable length algorithm.

Thanks

Community
  • 1
  • 1
user972616
  • 1,324
  • 3
  • 18
  • 38
  • 4
    Try Levensthein distance with lower-case only strings perhaps ? – driis Mar 11 '12 at 17:43
  • 1
    So, is AOOOBCD a better match than Abcd? I suspect your notion of "better" won't hold up fire very well. Probably the best you can hope for is to rank them by some arbitray rules (e.g., first by Levenstein, second by number of case changes, etc.) and show a list to a user, whose notion of similar may not match yours. – Ira Baxter Mar 11 '12 at 17:45
  • @IraBaxter Nope "Abcd" will be a better match than "AOOOBCD" – user972616 Mar 11 '12 at 17:48
  • This isn't really a programming (implementation) question. This is a question of design - design of a suitable similarity function. – Bradley Thomas Mar 29 '16 at 19:08

3 Answers3

5

Try something like this

double d = (LevenshteinDist(s, t) + LevenshteinDist(s.ToLower(), t.ToLower())) /
           2.0 * Math.Max(s.Length, t.Length);

If you want to give less importance to case differences than letter differences, you can give different weights to the terms

double d = (0.15*LevenshteinDist(s, t) + 
            0.35*LevenshteinDist(s.ToLower(), t.ToLower())) /
           Math.Max(s.Length, t.Length);

Note that the weights sum up to 0.5, thus makting the division by 2.0 obsolete.

Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188
2
    bool check(string[] a, string s)
    {
        for (int i = 0; i < a.Length; i++)
            if (s == a[i])
                return true;
        return false;
    }

    public double simi(string string1, string string2)
    {
        int sub1 = 0;
        int sub2 = 0;
        string[] sp1 = new string[string1.Length - 1];
        string[] sp2 = new string[string2.Length - 1];
        string[] sp3 = new string[string1.Length - 1];
        string[] sp4 = new string[string2.Length - 1];
        for (int i = 0; i < string1.Length - 1; i++)
        {
            string x = "";
            x = string1.Substring(i, 2);

            sp1[sub1] = x;
            ++sub1;
        }
        for (int i = 0; i < string2.Length - 1; i++)
        {
            string x = "";
            x = string2.Substring(i, 2);
            sp2[sub2] = x;
            ++sub2;
        }


        int j = 0, k = 0;

        for (int i = 0; i < sp1.Length; i++)
            if (check(sp3, sp1[i]) == true)
            {

                continue;
            }
            else
            {
                sp3[j] = sp1[i];
                j++;

            }

        for (int i = 0; i < sp2.Length; i++)
            if (check(sp4, sp2[i]) == true)
            {

                continue;
            }
            else
            {
                sp4[k] = sp2[i];
                k++;


            }

        Array.Resize(ref sp3, j);
        Array.Resize(ref sp4, k);

        Array.Sort<string>(sp3);
        Array.Sort<string>(sp4);

        int n = 0;


        for (int i = 0; i < sp3.Length; i++)
        {

            if (check(sp4, sp3[i]))
            {

                n++;
            }


        }

        double resulte;

        int l1 = sp3.Length;
        int l2 = sp4.Length;

        resulte = ((2.0 * Convert.ToDouble(n)) / Convert.ToDouble(l1 + l2)) * 100;

        return resulte;
    }
Safaa
  • 31
  • 1
  • 6
2

Adapt Levenshtein Distance with a custom table T. Let the cost of insertion = 1. The cost of deletion also 1. Let T(c,d) denote the penalty of replacing c with d. T(c,c) should be = 0. T(c,d) should be <= 2.

Define Max(n,m) be the maximum theoretical distance of strings of length n and m. Obviously, Max(n,m) = n+m.

Define Distance(s,t) be the cost of changing s to t divided by Max(s,t). There you go.

Be careful in defining T so that the definition obeys distance axioms:

  • Distance(s,s) = 0
  • Distance(s,t) = Distance(t,s)
  • Distance(s,t) <= Distance(s,u) + Distance(u,t)

Then it will be more useful in more situations.

Ali Ferhat
  • 2,511
  • 17
  • 24