Distance between two strings

Question

I don’t believe the standard library provides anything to compute the distance between two strings and I can’t seem to find anything in Boost StringAlgo. So, is there any other library I could use?

I am not too picky about the algorithm. Jaro-Winkler is fine, Levenshtein too, and I am open to suggestions, I don’t want to code something that someone has already coded..

Jaro-Winkler, Levenshtein, anything that offers me a scale from "very close" to "extremely different". — qdii, Feb 28 '13 at 13:16
This one looks good: http://rosettacode.org/wiki/Levenshtein_distance#C.2B.2B (#3 hit on Google for "c++ levenshtein distance") — John Dvorak, Feb 28 '13 at 13:18
@SebbyJohanns true. I would not know how to define them though. Ideally I would like to test different metrics and see which one fits my problem best. — qdii, Feb 28 '13 at 13:19
copy paste one of these http://rosettacode.org/wiki/Levenshtein_distance ? — , Feb 28 '13 at 13:20
I will do that. I secretely hoped that some dusty corner of Boost had that. I don’t like using codes from random sites. — qdii, Feb 28 '13 at 13:21
Maybe I am mistaken, but I really thought any application wanting to do something like “Cannot find 'bla' did you mean 'blu'” would use something like that, so I am surprised it's not standard yet. — qdii, Feb 28 '13 at 13:24
Also #1 Google hit looks good to me: http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#C.2B.2B — Christian Ammer, Feb 28 '13 at 13:26

Waleed Khan · Answer 1 · 2013-11-26T21:29:04.413

You don't define your question with an actual distance metric, so I presume it just has to satisfy the conditions in "Metric (mathematics)":

A metric on a set X is a function (called the distance function or simply distance) d : X × X → R (where R is the set of real numbers). For all x, y, z in X, this function is required to satisfy the following conditions:

d(x, y) ≥ 0 (non-negativity, or separation axiom)

d(x, y) = 0 if and only if x = y (identity of indiscernibles, or coincidence axiom)

d(x, y) = d(y, x) (symmetry)

d(x, z) ≤ d(x, y) + d(y, z) (subadditivity / triangle inequality).

Suppose we define d as such:

          { 0 if x = y
d(x, y) = {
          { 1 otherwise

So the first three conditions are satisfied:

d(x, y) ≥ 0
d(x, y) = 0 iff x = y
d(x, y) = d(y, x) = 0 for x = y, and d(x, y) = d(y, x) = 1 for x ≠ y

For the last condition, there are two cases:

d(x, z) = 0. The only conceivable values for the right-hand side are 0, 1, and 2, any of which would satisfy the condition.
d(x, z) = 1. Suppose that the right-hand side isn't greater than or equal to one. This means that it must be zero. Then both terms on the right hand side would each have to be 0, which means that x = y and y = z. The second condition means that x = z, which in turn means that d(x, z) = 0. This is a contradiction, so the right-hand side must be greater than or equal to one.

Then we can define the metric as:

int d(std::string x, std::string y) {
    if (x == y) {
        return 0;
    } else {
        return 1;
    }
}

As a mathematician, I like your answer :) – qdii Feb 28 '13 at 16:38 — qdii, Feb 28 '13 at 16:38

Richard · Answer 2 · 2013-02-28T15:27:05.967

You could try SimString.

SimString is a simple library for fast approximate string retrieval. Approximate string retrieval finds strings in a database whose similarity with a query string is no smaller than a threshold. Finding not only identical but similar strings, approximate string retrieval has various applications including spelling correction, flexible dictionary matching, duplicate detection, and record linkage.

SimString supports cosine, Jaccard, dice, and overlap coefficients as similarity measures. SimString uses letter n-grams as features for computing string similarity.

Or the SimMetric library.

SimMetrics is a Similarity Metric Library, e.g. from edit distance's (Levenshtein, Gotoh, Jaro etc) to other metrics, (e.g Soundex, Chapman). Work provided by UK Sheffield University funded by (AKT) an IRC sponsored by EPSRC, grant number GR/N15764/01.

Or the libdistance library, which has implementations of the Levenshtein, Dameru, Needleman-Wunsch, Hamming, Bloom Filter, Jaccard, and Minkowski distances.

Phonetic algorithms may also be of interest.

See also [this question](http://stackoverflow.com/questions/83777/are-there-any-fuzzy-search-or-string-similarity-functions-libraries-written-for). — Richard, Feb 28 '13 at 15:21
And [this question](http://stackoverflow.com/questions/907997/string-distance-library). — Richard, Feb 28 '13 at 15:25
Also checkout https://github.com/Martinsos/edlib for C/C++ implementation of Levenshtein distance! — Martinsos, Apr 27 '16 at 20:49

score 0 · Answer 3 · edited May 23 '17 at 12:18

0

This related question contains a code snippet demonstrating the Levenshtein distance. It has also been implemented for MySQL in this C code.

edited May 23 '17 at 12:18

Community

1
1

answered Feb 28 '13 at 15:18

Paul Lammertsma

37,593
16
136
187

Distance between two strings

3 Answers3