51

I have been looking for an advanced levenshtein distance algorithm, and the best I have found so far is O(n*m) where n and m are the lengths of the two strings. The reason why the algorithm is at this scale is because of space, not time, with the creation of a matrix of the two strings such as this one:

alt text

Is there a publicly-available levenshtein algorithm which is better than O(n*m)? I am not averse to looking at advanced computer science papers & research, but haven't been able to find anything. I have found one company, Exorbyte, which supposedly has built a super-advanced and super-fast Levenshtein algorithm but of course that is a trade secret. I am building an iPhone app which I would like to use the Levenshtein distance calculation. There is an objective-c implementation available, but with the limited amount of memory on iPods and iPhones, I'd like to find a better algorithm if possible.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Jason
  • 14,517
  • 25
  • 92
  • 153

4 Answers4

55

Are you interested in reducing the time complexity or the space complexity ? The average time complexity can be reduced O(n + d^2), where n is the length of the longer string and d is the edit distance. If you are only interested in the edit distance and not interested in reconstructing the edit sequence, you only need to keep the last two rows of the matrix in memory, so that will be order(n).

If you can afford to approximate, there are poly-logarithmic approximations.

For the O(n +d^2) algorithm look for Ukkonen's optimization or its enhancement Enhanced Ukkonen. The best approximation that I know of is this one by Andoni, Krauthgamer, Onak

srean
  • 2,578
  • 18
  • 18
  • 2
    I use this for DNA alignment; We check for the length of the sequences first since the logic for updating the Ukkonen barrier is heavier then just calculating the whole array. Also, take a look at "Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison" for some more details. – nlucaroni Dec 13 '10 at 22:05
  • 3
    The original paper for the Ukkonen Approximate String Matching Algorithm is, http://www.cs.helsinki.fi/u/ukkonen/InfCont85.PDF. – nlucaroni Dec 13 '10 at 22:07
  • 1
    Actually, you don't need the last two rows of the matrix. The last row, plus the previous number in the current row, is enough. Also note that implementing Levenshtein this way is significantly faster than using the full matrix, probably due to CPU caching. – larsga Jan 04 '13 at 11:26
  • An implementation of the Extended Ukkonen algorithm is [here](https://github.com/haifengl/smile/blob/master/math/src/main/java/smile/math/distance/EditDistance.java). It requires a lot of space (O(mn)), but runs quickly when the strings are similar - probably the formula O(n+d^2) is correct for its time complexity. Make sure to call the right constructor (the one with maxStringLength) and reuse the EditDistance instance to get the benefits (otherwise array initialization cost negates them). – Stefan Reich Nov 19 '19 at 23:46
  • @nlucaroni do you compare two strings always or are you trying to find a fuzzy match for a sequence within large database? – Kangur Oct 11 '21 at 09:27
  • The former, this isn't an EST searching solution. – nlucaroni Apr 04 '22 at 20:15
14

If you only want the threshold function - eg, to test if the distance is under a certain threshold - you can reduce the time and space complexity by only calculating the n values either side of the main diagonal in the array. You can also use Levenshtein Automata to evaluate many words against a single base word in O(n) time - and the construction of the automatons can be done in O(m) time, too.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • And how is it possible that with only one side you know that the other side does not have a path with less distance? That would mean calculating the minimum distance with just one half. And the main diagonal? If strings have different sizes the main diagonal does not even reach the end of the matrix. I think I do not understand what you are saying. – Vixxs Apr 10 '20 at 23:23
  • This blog is really great for understanding the concept http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata – A. Pine Mar 25 '22 at 18:34
3

Look in Wiki - they have some ideas to improve this algorithm to better space complexity:

Wiki-Link: Levenshtein distance

Quoting:

We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.

Dani
  • 14,639
  • 11
  • 62
  • 110
  • One explained in wikipedia for space complexity that is using two rows does not provide correct solution for strings where length(s)>length(t). Lets say to convert S=ab to T=abcd we need two changes. That solution gives 1 as answer. Check it out. – Abhishek Kaushik Oct 13 '14 at 05:07
-1

I found another optimization that claims to be O(max(m, n)):

http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#C

(the second C implementation)

nponeccop
  • 13,527
  • 1
  • 44
  • 106