I need to know how far the two strings are similar in terms of some measuring index
Asked
Active
Viewed 323 times
2 Answers
6
Probably the most commonly used metric is something called Levenshtein Distance, sometimes called "Edit Distance". In simple terms, it measures the number of edits (additions, deletions, or in a generalized method, also transpositions) that you need to make the strings in the comparison identical.
The algorithm has simple, efficient, and well-known realizations here's the pseudo code directly from the Wikipedia article linked earlier:
int LevenshteinDistance(char s[1..m], char t[1..n])
{
// for all i and j, d[i,j] will hold the Levenshtein distance between
// the first i characters of s and the first j characters of t;
// note that d has (m+1)x(n+1) values
declare int d[0..m, 0..n]
for i from 0 to m
d[i, 0] := i // the distance of any first string to an empty second string
for j from 0 to n
d[0, j] := j // the distance of any second string to an empty first string
for j from 1 to n
{
for i from 1 to m
{
if s[i] = t[j] then
d[i, j] := d[i-1, j-1] // no operation required
else
d[i, j] := minimum
(
d[i-1, j] + 1, // a deletion
d[i, j-1] + 1, // an insertion
d[i-1, j-1] + 1 // a substitution
)
}
}
return d[m,n]
}
See also this related SO question: Good Python modules for fuzzy string comparison?

Community
- 1
- 1

Mark Elliot
- 75,278
- 22
- 140
- 160