39

I'm looking for an algorithm that takes 2 strings and will give me back a "factor of similarity".

Basically, I will have an input that may be misspelled, have letters transposed, etc, and I have to find the closest match(es) in a list of possible values that I have.

This is not for searching in a database. I'll have an in-memory list of 500 or so strings to match against, all under 30 chars, so it can be relatively slow.

I know this exists, i've seen it before, but I can't remember its name.


Edit: Thanks for pointing out Levenshtein and Hamming. Now, which one should I implement? They basically measure different things, both of which can be used for what I want, but I'm not sure which one is more appropriate.

I've read up on the algorithms, Hamming seems obviously faster. Since neither will detect two characters being transposed (ie. Jordan and Jodran), which I believe will be a common mistake, which will be more accurate for what I want? Can someone tell me a bit about the trade-offs?

Nick Fortescue
  • 43,045
  • 26
  • 106
  • 134
Daniel Magliola
  • 30,898
  • 61
  • 164
  • 243
  • Actually, both Hamming and Levenshtein distance detect transpositions, each assigning a cost of 2. This is one of the few typical errors that Hamming distance *will* pick up sensibly -- any single-character insertion or deletion will immediately give you huge dissimilarity scores. Use Levenshtein. – j_random_hacker Feb 24 '09 at 13:06

4 Answers4

43

Ok, so the standard algorithms are:

1) Hamming distance Only good for strings of the same length, but very efficient. Basically it simply counts the number of distinct characters. Not useful for fuzzy searching of natural language text.

2) Levenstein distance. The Levenstein distance measures distance in terms of the number of "operations" required to transform one string to another. These operations include insertion, deletion and substition. The standard approach of calculating the Levenstein distance is to use dynamic programming.

3) Generalized Levenstein/(Damerau–Levenshtein distance) This distance also takes into consideration transpositions of characters in a word, and is probably the edit distance most suited for fuzzy matching of manually-entered text. The algorithm to compute the distance is a bit more involved than the Levenstein distance (detecting transpositions is not easy). Most common implementations are a modification of the bitap algorithm (like grep).

In general you would probably want to consider an implementation of the third option implemented in some sort of nearest neighbour search based on a k-d tree

Il-Bhima
  • 10,744
  • 1
  • 47
  • 51
4
  • Levenstein distance
  • Hamming distance
  • soundex
  • metaphone
vartec
  • 131,205
  • 36
  • 218
  • 244
  • Hmmm... ok... which one should I use? If I remember correctly, Soundex is not useful, because it's dependent on the first letter being the same, plus the time I used it (different project), I wasn't very happy about it. What's the tradeoffs between Levenshtein and Hamming, for example? – Daniel Magliola Feb 23 '09 at 12:28
  • Hamming distance can only be used on strings of the same length... Levenshtein distance is like a generalization of Hamming distance – tehvan Feb 23 '09 at 12:34
  • Well, Hamming distance is more for theoretical purposes. If you want to correct or ignore typos -- Levenstein. If you want to correct or ignore bad spelling -- metaphone. Note however, that Levenstein works in any language, metaphone -- English only. – vartec Feb 23 '09 at 12:37
  • As far as Soundex is concerned, I don't think it depends on the first letter being the same and you can find localized versions of the algorithm (at least for French). I personally had good results with the French version. – saalaa Feb 23 '09 at 12:49
  • Not to say that Soundex can be very handy for large data sets since you will get a Soundex representation of your word which never varies. I had over 80000 entries in a database and stored their Soundex representations for fast matching against a needle. I never made any real benchmark though. – saalaa Feb 23 '09 at 12:53
  • Actually it was more around 120000 different words/soundex pairs. Sound matching was bleeding fast since for a given auery I only had to compute a Soundex representation for the needle. – saalaa Feb 23 '09 at 12:56
  • Benoit, yes, I agree. I've used Soundex for the same reasons (I could precompute it for my records, and do a fast SQL query on it). But in this case, I need more accuracy and less speed. – Daniel Magliola Feb 23 '09 at 13:12
  • One plus for Soundex is that MSSQL has built in support for it. – Stefan Feb 23 '09 at 13:14
4

the Damerau-Levenshtein distance is similar to the Levenshtein distance, but also includes two-character transposition. the wikipedia page (linked) includes pseudocode that should be fairly trivial to implement.

Autoplectic
  • 7,566
  • 30
  • 30
2

You're looking for the Levenshtein distance

tehvan
  • 10,189
  • 5
  • 27
  • 31