2

I need to know how far the two strings are similar in terms of some measuring index

joolie
  • 399
  • 1
  • 5
  • 7

2 Answers2

6

Probably the most commonly used metric is something called Levenshtein Distance, sometimes called "Edit Distance". In simple terms, it measures the number of edits (additions, deletions, or in a generalized method, also transpositions) that you need to make the strings in the comparison identical.

The algorithm has simple, efficient, and well-known realizations here's the pseudo code directly from the Wikipedia article linked earlier:

 int LevenshteinDistance(char s[1..m], char t[1..n])
 {
   // for all i and j, d[i,j] will hold the Levenshtein distance between
   // the first i characters of s and the first j characters of t;
   // note that d has (m+1)x(n+1) values
   declare int d[0..m, 0..n]

   for i from 0 to m
     d[i, 0] := i // the distance of any first string to an empty second string
   for j from 0 to n
     d[0, j] := j // the distance of any second string to an empty first string

   for j from 1 to n
   {
     for i from 1 to m
     {
       if s[i] = t[j] then  
         d[i, j] := d[i-1, j-1]       // no operation required
       else
         d[i, j] := minimum
                    (
                      d[i-1, j] + 1,  // a deletion
                      d[i, j-1] + 1,  // an insertion
                      d[i-1, j-1] + 1 // a substitution
                    )
     }
   }

   return d[m,n]
 }

See also this related SO question: Good Python modules for fuzzy string comparison?

Community
  • 1
  • 1
Mark Elliot
  • 75,278
  • 22
  • 140
  • 160
5

Lucky for you, Python comes with the difflib module :)

Check out get_close_matches function

st0le
  • 33,375
  • 8
  • 89
  • 89