31

I just implemented a best match file search algorithm to find the closest match to a string in a dictionary. After profiling my code, I found out that the overwhelming majority of time is spent calculating the distance between the query and the possible results. I am currently implementing the algorithm to calculate the Levenshtein Distance using a 2-D array, which makes the implementation an O(n^2) operation. I was hoping someone could suggest a faster way of doing the same.

Here's my implementation:

public int calculate(String root, String query)
{
  int arr[][] = new int[root.length() + 2][query.length() + 2];

  for (int i = 2; i < root.length() + 2; i++)
  {
    arr[i][0] = (int) root.charAt(i - 2);
    arr[i][1] = (i - 1);
  }

  for (int i = 2; i < query.length() + 2; i++)
  {
    arr[0][i] = (int) query.charAt(i - 2);
    arr[1][i] = (i - 1);
  }

  for (int i = 2; i < root.length() + 2; i++)
  {
    for (int j = 2; j < query.length() + 2; j++)
    {
      int diff = 0;
      if (arr[0][j] != arr[i][0])
      {
        diff = 1;
      }
      arr[i][j] = min((arr[i - 1][j] + 1), (arr[i][j - 1] + 1), (arr[i - 1][j - 1] + diff));
    }
  }
  return arr[root.length() + 1][query.length() + 1];
}

public int min(int n1, int n2, int n3)
{
  return (int) Math.min(n1, Math.min(n2, n3));
}
The Guy with The Hat
  • 10,836
  • 8
  • 57
  • 75
efficiencyIsBliss
  • 3,043
  • 7
  • 38
  • 44
  • yes, the other things are efficient enough. I profiled my code and found that the bottleneck was calculating the levenshtein distance which is why I'm trying to optimize that bit right now. I'm implementing the improvement mentioned in the wikipedia article and I'll follow it up with an implementation of the VP-tree to see which one is more efficient. – efficiencyIsBliss Jul 06 '10 at 13:56
  • 4
    about "using a 2-D array, which makes the implementation an O(n^2) operation": calculating a Levenshtein distance between two sequences with no constraints is already an O(n^2) operation irrespective of how much memory you use -- using a 2-D array just slows you down and wastes memory; only O(n) memory is required. – John Machin Jul 07 '10 at 00:58
  • @John Machin I know this is ancient but could you provide an example or some link to how that O(n) space solution would be implemented? – celavek Dec 14 '18 at 08:19

6 Answers6

28

The wikipedia entry on Levenshtein distance has useful suggestions for optimizing the computation -- the most applicable one in your case is that if you can put a bound k on the maximum distance of interest (anything beyond that might as well be infinity!) you can reduce the computation to O(n times k) instead of O(n squared) (basically by giving up as soon as the minimum possible distance becomes > k).

Since you're looking for the closest match, you can progressively decrease k to the distance of the best match found so far -- this won't affect the worst case behavior (as the matches might be in decreasing order of distance, meaning you'll never bail out any sooner) but average case should improve.

I believe that, if you need to get substantially better performance, you may have to accept some strong compromise that computes a more approximate distance (and so gets "a reasonably good match" rather than necessarily the optimal one).

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 1
    I just found the same and was about to post it here. Thanks! – efficiencyIsBliss Jul 06 '10 at 02:57
  • You can always create with a small set of matches using whatever "approximate distance" metric you come up with, then use real Levenshtein to rank those. – kibibu Jul 06 '10 at 04:06
  • I read up on the optimization that the above post talks of, but I can't honestly say I get it. I'm not sure how to implement it either. Can someone offer some help? – efficiencyIsBliss Jul 07 '10 at 02:29
  • @Dharmesh, not in comments -- way too cramped -- but a separate question on _finding closest neighbor_ (the main optimization you can perform being to _not_ fully compute the distance in most cases -- and that optimization is only possible because you want the nearest neighbor, not for all distance computation tasks!) ideally with some samples of typical/interesting roots and sets of queries, should elicit lots of help (be sure to tag it as "algorithm" too!). – Alex Martelli Jul 07 '10 at 03:35
  • @Alex: that optimisation is also possible if the OP wanted to find all "likely" matches, "likely" being defined as e.g. `distance(query,candidate)/max(len(query),len(candidate)) < some_threshold_fraction` – John Machin Jul 07 '10 at 07:31
  • @Alex: I realized that the trick is to not compute the entire distance. I just don't fully understand what part of the entire distance I should compute and that is what I wanted help with. I'll make the new question with the name you suggested. If you could offer some more help there, I'd really appreciate it. – efficiencyIsBliss Jul 07 '10 at 13:10
  • @John: I don't think I need all the likely matches. I just need the distance of the closest match. – efficiencyIsBliss Jul 07 '10 at 13:11
  • I posted the new question under the name "Finding closest neighbour using optimized Levenshtein Algorithm". Here's the link: http://stackoverflow.com/questions/3195269/finding-closest-neighbour-using-optimized-levenshtein-algorithm – efficiencyIsBliss Jul 07 '10 at 13:42
  • @Dharmesh, I see you're getting excellent answers on the new question so hopefully your doubts about what to compute and how are now being quelled without the need for further input from me. – Alex Martelli Jul 07 '10 at 18:31
  • @Alex: Yeah, I got a lot of good help there. Some more wouldn't hurt though, if you have the time. – efficiencyIsBliss Jul 07 '10 at 19:47
  • Thanks to your idea I just contributed to apache-commons-lang, how weird that they didn't do this already? https://github.com/apache/commons-lang/pull/118 – jontejj Nov 26 '15 at 10:04
  • Hello :), I have exactly posted the question about how to put a bound at the Levenstein distance here (https://stackoverflow.com/questions/59686989/break-for-greater-than-n-levenshtein-distance) and nobody has answered yet. Do you have any code which does this? – Outcast Feb 05 '20 at 18:11
8

According to a comment on this blog, Speeding Up Levenshtein, you can use VP-Trees and achieve O(nlogn). Another comment on the same blog points to a python implementation of VP-Trees and Levenshtein. Please let us know if this works.

beapea
  • 5
  • 4
Andrew B.
  • 1,010
  • 6
  • 19
  • 10
    I realize this is an old thread, but you got me confused for a minute. You're not talking about the same `n` ! In the OP `n` is the word length and we're interested in the time to compute distance between 2 words. While for you `n` is the number of words in the dictionary, and `n log n` is the number of times you call the distance function. If you combine both it's `W k D log D` with `D` dict size, `W` word size, `k` distance threshold. – Antoine Apr 09 '15 at 15:58
  • As of 2023-01-01, the link "Speeding Up Levenshtein" is broken, and I don't see a good fix for it. The python code linked to shows an example of using VP-Trees to the spell check problem. It uses Levenshtein distance, as a way of applying a function to the nodes in the VP-Tree. The Levenshtein distance is still `O(m*n)` where m and n are the sizes of the words being compared. – hlongmore Jan 01 '23 at 09:54
3

I modified the Levenshtein distance VBA function found on this post to use a one dimensional array. It performs much faster.

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)

Public Function LevenshteinDistance2(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long, ssL As Long, cost As Long 'loop counters, loop step, loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j        'setup array positions 0,1,2,3,4,...

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If Mid$(s1, i Mod ssL, 1) <> Mid$(s2, j, 1) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost

        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance2 = D(LD)
End Function

I have tested this function with string 's1' of length 11,304 and 's2' of length 5,665 ( > 64 million character comparisons). With the above single dimension version of the function, the execution time is ~24 seconds on my machine. The original two dimensional function that I referenced in the link above requires ~37 seconds for the same strings. I have optimized the single dimensional function further as shown below and it requires ~10 seconds for the same strings.

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long         'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long                       'loop counters, loop step
Dim ssL As Long, cost As Long                               'loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long                      'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long                              'Length of S1 + 1, Length of S1 + 2
Dim sss1() As String, sss2() As String                      'Character arrays for string S1 & S2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                    'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j            'setup array positions 0,1,2,3,4,...

ReDim sss1(1 To L1)                                         'Size character array S1
ReDim sss2(1 To L2)                                         'Size character array S2
For i = 1 To L1 Step 1: sss1(i) = Mid$(s1, i, 1): Next i    'Fill S1 character array
For i = 1 To L2 Step 1: sss2(i) = Mid$(s2, i, 1): Next i    'Fill S2 character array

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If sss1(i Mod ssL) <> sss2(j) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost
        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance = D(LD)
End Function
Community
  • 1
  • 1
3

The Wikipedia article discusses your algorithm, and various improvements. However, it appears that at least in the general case, O(n^2) is the best you can get.

There are however some improvements if you can restrict your problem (e.g. if you are only interested in the distance if it's smaller than d, complexity is O(dn) - this might make sense as a match whose distance is close to the string length is probably not very interesting ). See if you can exploit the specifics of your problem...

sleske
  • 81,358
  • 34
  • 189
  • 227
3

Commons-lang has a pretty fast implementation. See http://web.archive.org/web/20120526085419/http://www.merriampark.com/ldjava.htm.

Here's my translation of that into Scala:

// The code below is based on code from the Apache Commons lang project.
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements. See the NOTICE file distributed with this
 * work for additional information regarding copyright ownership. The ASF
 * licenses this file to You under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance with the
 * License. You may obtain a copy of the License at
 * 
 * http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations
 * under the License.
 */
/**
* assert(levenshtein("algorithm", "altruistic")==6)
* assert(levenshtein("1638452297", "444488444")==9)
* assert(levenshtein("", "") == 0)
* assert(levenshtein("", "a") == 1)
* assert(levenshtein("aaapppp", "") == 7)
* assert(levenshtein("frog", "fog") == 1)
* assert(levenshtein("fly", "ant") == 3)
* assert(levenshtein("elephant", "hippo") == 7)
* assert(levenshtein("hippo", "elephant") == 7)
* assert(levenshtein("hippo", "zzzzzzzz") == 8)
* assert(levenshtein("hello", "hallo") == 1)
*
*/
def levenshtein(s: CharSequence, t: CharSequence, max: Int = Int.MaxValue) = {
import scala.annotation.tailrec
def impl(s: CharSequence, t: CharSequence, n: Int, m: Int) = {
  // Inside impl n <= m!
  val p = new Array[Int](n + 1) // 'previous' cost array, horizontally
  val d = new Array[Int](n + 1) // cost array, horizontally

  @tailrec def fillP(i: Int) {
    p(i) = i
    if (i < n) fillP(i + 1)
  }
  fillP(0)

  @tailrec def eachJ(j: Int, t_j: Char, d: Array[Int], p: Array[Int]): Int = {
    d(0) = j
    @tailrec def eachI(i: Int) {
      val a = d(i - 1) + 1
      val b = p(i) + 1
      d(i) = if (a < b) a else {
        val c = if (s.charAt(i - 1) == t_j) p(i - 1) else p(i - 1) + 1
        if (b < c) b else c
      }
      if (i < n)
        eachI(i + 1)
    }
    eachI(1)

    if (j < m)
      eachJ(j + 1, t.charAt(j), p, d)
    else
      d(n)
  }
  eachJ(1, t.charAt(0), d, p)
}

val n = s.length
val m = t.length
if (n == 0) m else if (m == 0) n else {
  if (n > m) impl(t, s, m, n) else impl(s, t, n, m)
}

}

suriv
  • 757
  • 8
  • 13
nafg
  • 2,424
  • 27
  • 25
2

I know this is very late but it is relevant to the discussion at hand.

As mentioned by others, if all you want to do is check whether the edit distance between two strings is within some threshold k, you can reduce the time complexity to O(kn). A more precise expression would be O((2k+1)n). You take a strip which spans k cells either side of the diagonal cell (length of strip 2k+1) and compute the values of cells lying on this strip.

Interestingly, there's been an improvement by Li et. al. and this has been further reduced to O((k+1)n).