What string similarity algorithms are there?

Question

I need to compare 2 strings and calculate their similarity, to filter down a list of the most similar strings.

e.g. searching for "dog" would return

dog
doggone
bog
fog
foggy

e.g. searching for "crack" would return

crack
wisecrack
rack
jack
quack

I have come across:

What other string similarity algorithms are there?

Community wiki because there is no right answer to your question — Joe Phillips, Aug 26 '10 at 14:40
possible duplicate of [A better similarity ranking algorithm for variable length strings](http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings) — j_random_hacker, Aug 30 '10 at 06:53
There are *loads* of questions already dealing with *exactly* this topic. Please search before asking a question. — j_random_hacker, Aug 30 '10 at 06:53
@j_random_hacker: Loads of questions, you say? Link me to a question that demonstrates at least one new technique? And I've seen the one you linked already, before posting my question. I'm not trying to do any simple ranking or sorting, but rather a very accurate similarity algorithm that would return the results I displayed had I been searching a dictionary database. — Robin Rodricks, Aug 31 '10 at 15:28
There are many ways to measure similarity, and you don't explain exactly what sort you're looking for, but based on your examples and the fact that you don't like Levenshtein distance I think you're after some sort of approximate substring matching algorithm. Variants of your question are *the* most popular topic under the `algorithm` tag. Here are some links: — j_random_hacker, Aug 31 '10 at 23:50
http://stackoverflow.com/questions/49263/approximate-string-matching-algorithms — j_random_hacker, Aug 31 '10 at 23:51
http://stackoverflow.com/questions/3326200/match-sub-string-within-a-string-with-tolerance-of-1-character-mismatch — j_random_hacker, Aug 31 '10 at 23:51
http://stackoverflow.com/questions/1938678/q-gram-approximate-matching-optimisations — j_random_hacker, Aug 31 '10 at 23:53
http://stackoverflow.com/questions/451884/similar-string-algorithm — j_random_hacker, Aug 31 '10 at 23:55
http://stackoverflow.com/questions/3329297/finding-groups-of-similar-strings-in-a-large-set-of-strings — j_random_hacker, Sep 01 '10 at 00:02

score 33 · Answer 1 · edited Nov 20 '15 at 21:08

33

The Levenshtein distance is the algorithm I would recommend. It calculates the minimum number of operations you must do to change 1 string into another. The fewer changes means the strings are more similar...

edited Nov 20 '15 at 21:08

Chuck

234,037
30
302
389

answered Aug 26 '10 at 14:47

Peter

963
6
10

4

Levenshtein distance and all its permutations (such as Dam-Lev) work horribly, even QuickSilver outdoes it in the most basic of comparisons. See http://stackoverflow.com/questions/3338889/how-to-find-similar-results-and-sort-by-similarity – Robin Rodricks Aug 26 '10 at 14:50
26

@Jenko: You say Levenshtein distance works "horribly", but you don't give any criteria for deciding what's good or bad. Considering Levenshtein distance is pretty much the archetypal "string similarity algorithm", you should make your question more specific. – j_random_hacker Aug 30 '10 at 06:50
@j_random_hacker: Edited your post to show you why. I linked you to the question that contained the same results, why you didn't read that I don't understand. – Robin Rodricks Aug 31 '10 at 15:25
2

@Jenko: (1) It's not my post. (2) "Erratic" is not a meaningful criterion. I get that you're not happy with the results, but you need to explain precisely what types of similarity you're looking for. And BTW, you would normally set an upper bound on the Lev distance so that only answers 1-3 would be returned in your example. – j_random_hacker Aug 31 '10 at 23:44
I'm using levenshtein distance for a project I'm working on right now and it has proven to be less than ideal for this use case. I found that knowing how many letters in a row match is just as important as matching the same letters in the string in the same order which is essentially what edit distance does. – MetaStack May 23 '16 at 22:23

score 20 · Accepted Answer · edited Mar 23 '13 at 14:33

20

It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.

edited Mar 23 '13 at 14:33

Rob W

341,306
83
791
678

answered Aug 26 '10 at 15:24

Mojo Risin

8,136
5
45
58

3

@PascalKlein An archived page is available at Wayback Machine. I've updated the link to http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html – Rob W Mar 23 '13 at 14:33
There is levenshtein and you could try pruning using a similarity score like Wu-Palmer (wup) which uses the esteemed Wordnet. Stanford NLP for java is a goto. There is also pattern,scipy,numpy; gensim for Python. Calculating Levenshtein is best done via the diagonal of a matrix. – Andrew Scott Evans Oct 04 '15 at 18:56

score 9 · Answer 3 · answered Aug 26 '10 at 17:00

If the focus is on performance, I would implement an algorithm based on a trie structure
(works well to find words in a text, or to help correct a word, but in your case you can find quickly all words containing a given word or all but one letter, for instance).

Please follow first the wikipedia link above.Tries is the fastest words sorting method (n words, search s, O(n) to create the trie, O(1) to search s (or if you prefer, if a is the average length, O(an) for the trie and O(s) for the search)).

A fast and easy implementation (to be optimized) of your problem (similar words) consists of

Make the trie with the list of words, having all letters indexed front and back (see example below)
To search s, iterate from s[0] to find the word in the trie, then s[1] etc...
In the trie, if the number of letters found is len(s)-k the word is displayed, where k is the tolerance (1 letter missing, 2...).
The algorithm may be extended to the words in the list (see below)

Example, with the words car, vars.

Building the trie (big letter means a word end here, while another may continue). The > is post-index (go forward) and < is pre-index (go backward). In another example we may have to indicate also the starting letter, it is not presented here for clarity.
The < and > in C++ for instance would be Mystruct *previous,*next, meaning from a > c < r, you can go directly from a to c, and reversely, also from a to R.

  1.  c < a < R
  2.  a > c < R
  3.    > v < r < S
  4.  R > a > c
  5.        > v < S
  6.  v < a < r < S
  7.  S > r > a > v

Looking strictly for car the trie gives you access from 1., and you find car (you would have found also everything starting with car, but also anything with car inside - it is not in the example - but vicar for instance would have been found from c > i > v < a < R).

To search while allowing 1-letter wrong/missing tolerance, you iterate from each letter of s, and, count the number of consecutive - or by skipping 1 letter - letters you get from s in the trie.

looking for car,

c: searching the trie for c < a and c < r (missing letter in s). To accept a wrong letter in a word w, try to jump at each iteration the wrong letter to see if ar is behind, this is O(w). With two letters, O(w²) etc... but another level of index could be added to the trie to take into account the jump over letters - making the trie complex, and greedy regarding memory.
a, then r: same as above, but searching backwards as well

This is just to provide an idea about the principle - the example above may have some glitches (I'll check again tomorrow).

Gumbo · Answer 4 · 2010-08-26T15:27:43.503

1

You could do this:

Foreach string in haystack Do
    offset := -1;
    matchedCharacters := 0;
    Foreach char in needle Do
        offset := PositionInString(string, char, offset+1);
        If offset = -1 Then
            Break;
        End;
        matchedCharacters := matchedCharacters + 1;
    End;
    If matchedCharacters > 0 Then
       // (partial) match found
    End;
End;

With matchedCharacters you can determine the “degree” of the match. If it is equal to the length of needle, all characters in needle are also in string. If you also store the offset of the first matched character, you can also sort the result by the “density” of the matched characters by subtracting the offset of the first matched character from the offset of the last matched character offset; the lower the difference, the more dense the match.

edited Aug 26 '10 at 15:27

answered Aug 26 '10 at 14:59

Gumbo

643,351
109
780
844

@Jenko: What do you mean? The search is linear, so each string in the list of strings is tested. – Gumbo Aug 26 '10 at 17:14
What does `PositionInString` mean? – Moritz Schmitz v. Hülst Jun 16 '15 at 11:29
@MoritzSchmitzv.Hülst `PositionInString` is meant to be a function, which returns the index position of *char* in *string* starting at *offset*. – Gumbo Jun 16 '15 at 18:56

score 0 · Answer 5 · edited Apr 05 '18 at 14:09

class Program { 
    static int ComputeLevenshteinDistance(string source, string target) {
        if ((source == null) || (target == null)) return 0;
        if ((source.Length == 0) || (target.Length == 0)) return 0;
        if (source == target) return source.Length;

        int sourceWordCount = source.Length;
        int targetWordCount = target.Length;

        int[,] distance = new int[sourceWordCount + 1, targetWordCount + 1];

        // Step 2
        for (int i = 0; i <= sourceWordCount; distance[i, 0] = i++);
        for (int j = 0; j <= targetWordCount; distance[0, j] = j++);

        for (int i = 1; i <= sourceWordCount; i++) {
            for (int j = 1; j <= targetWordCount; j++) {
                // Step 3
                int cost = (target[j - 1] == source[i - 1]) ? 0 : 1;

                // Step 4
                distance[i, j] = Math.Min(Math.Min(distance[i - 1, j] + 1, distance[i, j - 1] + 1), distance[i - 1, j - 1] + cost);
            }
        }

        return distance[sourceWordCount, targetWordCount]; 
    }

    static void Main(string[] args){ 
       Console.WriteLine(ComputeLevenshteinDistance ("Stackoverflow","StuckOverflow"));
       Console.ReadKey();
    }
}

I think he is asking for algorithms not implementation of the solution. — Giulio Caccin, Jul 24 '17 at 21:41

score 0 · Answer 6 · answered Jun 21 '23 at 10:38

I have used Levenshtein distance together with a soundex index, to try and come up with a fuzzy match to misspelled pharmaceutical names that are always quite lengthy and awkward. Soundex is a bit English centric so you would need something else for other languages.

What string similarity algorithms are there?

6 Answers6

Linked