4

I don’t believe the standard library provides anything to compute the distance between two strings and I can’t seem to find anything in Boost StringAlgo. So, is there any other library I could use?

I am not too picky about the algorithm. Jaro-Winkler is fine, Levenshtein too, and I am open to suggestions, I don’t want to code something that someone has already coded..

Paul Lammertsma
  • 37,593
  • 16
  • 136
  • 187
qdii
  • 12,505
  • 10
  • 59
  • 116
  • 7
    What do you mean by "distance between two strings"? – Graham Borland Feb 28 '13 at 13:15
  • 1
    What about the Hamming distance? This one is easy to code. – John Dvorak Feb 28 '13 at 13:16
  • 2
    OK, so not just how far apart they are in memory, then. :) – Graham Borland Feb 28 '13 at 13:16
  • Jaro-Winkler, Levenshtein, anything that offers me a scale from "very close" to "extremely different". – qdii Feb 28 '13 at 13:16
  • I agree that this question is lacking in specifics – johnbakers Feb 28 '13 at 13:17
  • 1
    This one looks good: http://rosettacode.org/wiki/Levenshtein_distance#C.2B.2B (#3 hit on Google for "c++ levenshtein distance") – John Dvorak Feb 28 '13 at 13:18
  • @SebbyJohanns true. I would not know how to define them though. Ideally I would like to test different metrics and see which one fits my problem best. – qdii Feb 28 '13 at 13:19
  • 2
    copy paste one of these http://rosettacode.org/wiki/Levenshtein_distance ? –  Feb 28 '13 at 13:20
  • 1
    I will do that. I secretely hoped that some dusty corner of Boost had that. I don’t like using codes from random sites. – qdii Feb 28 '13 at 13:21
  • Maybe I am mistaken, but I really thought any application wanting to do something like “Cannot find 'bla' did you mean 'blu'” would use something like that, so I am surprised it's not standard yet. – qdii Feb 28 '13 at 13:24
  • Also #1 Google hit looks good to me: http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#C.2B.2B – Christian Ammer Feb 28 '13 at 13:26

3 Answers3

8

You don't define your question with an actual distance metric, so I presume it just has to satisfy the conditions in "Metric (mathematics)":

A metric on a set X is a function (called the distance function or simply distance) d : X × X → R (where R is the set of real numbers). For all x, y, z in X, this function is required to satisfy the following conditions:

  • d(x, y) ≥ 0 (non-negativity, or separation axiom)
  • d(x, y) = 0 if and only if x = y (identity of indiscernibles, or coincidence axiom)
  • d(x, y) = d(y, x) (symmetry)
  • d(x, z) ≤ d(x, y) + d(y, z) (subadditivity / triangle inequality).

Suppose we define d as such:

          { 0 if x = y
d(x, y) = {
          { 1 otherwise

So the first three conditions are satisfied:

  • d(x, y) ≥ 0
  • d(x, y) = 0 iff x = y
  • d(x, y) = d(y, x) = 0 for x = y, and d(x, y) = d(y, x) = 1 for x ≠ y

For the last condition, there are two cases:

  • d(x, z) = 0. The only conceivable values for the right-hand side are 0, 1, and 2, any of which would satisfy the condition.
  • d(x, z) = 1. Suppose that the right-hand side isn't greater than or equal to one. This means that it must be zero. Then both terms on the right hand side would each have to be 0, which means that x = y and y = z. The second condition means that x = z, which in turn means that d(x, z) = 0. This is a contradiction, so the right-hand side must be greater than or equal to one.

Then we can define the metric as:

int d(std::string x, std::string y) {
    if (x == y) {
        return 0;
    } else {
        return 1;
    }
}
Waleed Khan
  • 11,426
  • 6
  • 39
  • 70
6

You could try SimString.

SimString is a simple library for fast approximate string retrieval. Approximate string retrieval finds strings in a database whose similarity with a query string is no smaller than a threshold. Finding not only identical but similar strings, approximate string retrieval has various applications including spelling correction, flexible dictionary matching, duplicate detection, and record linkage.

SimString supports cosine, Jaccard, dice, and overlap coefficients as similarity measures. SimString uses letter n-grams as features for computing string similarity.

Or the SimMetric library.

SimMetrics is a Similarity Metric Library, e.g. from edit distance's (Levenshtein, Gotoh, Jaro etc) to other metrics, (e.g Soundex, Chapman). Work provided by UK Sheffield University funded by (AKT) an IRC sponsored by EPSRC, grant number GR/N15764/01.

Or the libdistance library, which has implementations of the Levenshtein, Dameru, Needleman-Wunsch, Hamming, Bloom Filter, Jaccard, and Minkowski distances.

Phonetic algorithms may also be of interest.

Richard
  • 56,349
  • 34
  • 180
  • 251
  • See also [this question](http://stackoverflow.com/questions/83777/are-there-any-fuzzy-search-or-string-similarity-functions-libraries-written-for). – Richard Feb 28 '13 at 15:21
  • And [this question](http://stackoverflow.com/questions/907997/string-distance-library). – Richard Feb 28 '13 at 15:25
  • Also checkout https://github.com/Martinsos/edlib for C/C++ implementation of Levenshtein distance! – Martinsos Apr 27 '16 at 20:49
0

This related question contains a code snippet demonstrating the Levenshtein distance. It has also been implemented for MySQL in this C code.

Community
  • 1
  • 1
Paul Lammertsma
  • 37,593
  • 16
  • 136
  • 187