How to optimize levenshtein distance for checking for a distance of 1?

Question

I'm working on a game where I only need to check if there's a distance of 0 or 1 between two words and return true if that's the case. I found a general purpose levenshtein distance algorithm:

function levenshtein(s, t) {
  if (s === t) { return 0; }
  var n = s.length, m = t.length;
  if (n === 0 || m === 0) { return n + m; }
  var x = 0, y, a, b, c, d, g, h, k;
  var p = new Array(n);
  for (y = 0; y < n;) { p[y] = ++y; }
  for (;
    (x + 3) < m; x += 4) {
    var e1 = t.charCodeAt(x);
    var e2 = t.charCodeAt(x + 1);
    var e3 = t.charCodeAt(x + 2);
    var e4 = t.charCodeAt(x + 3);
    c = x; b = x + 1; d = x + 2; g = x + 3; h = x + 4;

    for (y = 0; y < n; y++) {
      k = s.charCodeAt(y);
      a = p[y];

      if (a < c || b < c) { c = (a > b ? b + 1 : a + 1); }
      else { if (e1 !== k) { c++; } }

      if (c < b || d < b) { b = (c > d ? d + 1 : c + 1); } 
      else { if (e2 !== k) { b++; } }

      if (b < d || g < d) { d = (b > g ? g + 1 : b + 1); } 
      else { if (e3 !== k) { d++; } }

      if (d < g || h < g) { g = (d > h ? h + 1 : d + 1); }
      else { if (e4 !== k) { g++; } }

      p[y] = h = g; g = d; d = b; b = c; c = a;
    }
  }

  for (; x < m;) {
    var e = t.charCodeAt(x);
    c = x;
    d = ++x;
    for (y = 0; y < n; y++) {
      a = p[y];
      if (a < c || d < c) { d = (a > d ? d + 1 : a + 1); } 
      else {
        if (e !== s.charCodeAt(y)) { d = c + 1; }
        else { d = c; }
      }
      p[y] = d;
      c = a;
    }
    h = d;
  }

  return h;
}

Which works, but this spot is going to be a hotspot and be run potentially hundreds of thousands of times a second and I want to optimize it because I don't need a general purpose algorithm, just one that checks if there's a distance of 0 or 1.

I tried writing it and came up with this:

function closeGuess(guess, word) {
  if (Math.abs(word.length - guess.length) > 1) { return false; }

  var errors = 0, guessIndex = 0, wordIndex = 0;

  while (guessIndex < guess.length || wordIndex < word.length) {
    if (errors > 1) { return false; }
    if (guess[guessIndex] !== word[wordIndex]) {
      if (guess.length < word.length) { wordIndex++; }
      else { guessIndex++; }
      errors++;
    } else {
      wordIndex++;
      guessIndex++;
    }
  }

  return true;
}

But after profiling it I found that my code was twice as slow, which surprised me because I think the general purpose algorithm is O(n*m) and I think mine is O(n).

I've been testing the performance difference on this fiddle: https://jsfiddle.net/aubtze2L/3/

Are there any better algorithms I can use or any way I can optimize my code to be faster?

le_m · Accepted Answer · 2016-06-24T15:39:42.360

I don't see a more elegant way which is at the same time faster than the good old for-loop:

function lev01(a, b) {
  let la = a.length;
  let lb = b.length;
  let d = 0;
  switch (la - lb) {
    case 0: // mutation
      for (let i = 0; i < la; ++i) {
        if (a.charAt(i) != b.charAt(i) && ++d > 1) {
          return false;
        }
      }
      return true;
    case -1: // insertion
      for (let i = 0; i < la + d; ++i) {
        if (a.charAt(i - d) != b.charAt(i) && ++d > 1) {
          return false;
        }
      }
      return true;
    case +1: // deletion
      for (let i = 0; i < lb + d; ++i) {
        if (a.charAt(i) != b.charAt(i - d) && ++d > 1) {
          return false;
        }
      }
      return true;
  }
  return false;
}

console.log(lev01("abc", "abc"));
console.log(lev01("abc", "abd"));
console.log(lev01("abc", "ab"));
console.log(lev01("abc", "abcd"));
console.log(lev01("abc", "cba"));

Performance comparison (Chrome):

80.33ms - lev01 (this answer)
234.84ms - lev
708.12ms - close

This is the fastest when you replace all `let`s with `var`. Changes the speed on my machine from 160ms to 120ms. Thanks. — Lawrence Douglas, Jun 26 '16 at 07:31

Dylon · Answer 2 · 2016-06-24T01:22:41.103

Consider the following cases:

If the difference in lengths of the terms is greater than 1, then the Levenshtein distance between them will be greater than 1.
If the difference in lengths is exactly 1, then the shortest string must be equal to the longest string, with a single deletion (or insertion).
If the strings are the same length then you should consider a modified version of Hamming distance which returns false if two, different characters are found:

Here is a sample implementation:

var areSimilar;

areSimilar = function(guess, word) {
  var charIndex, foundDiff, guessLength, lengthDiff, substring, wordLength, shortest, longest, shortestLength, offset;

  guessLength = guess.length;
  wordLength = word.length;

  lengthDiff = guessLength - wordLength;
  if (lengthDiff < -1 || lengthDiff > 1) {
    return false;
  }

  if (lengthDiff !== 0) {
    if (guessLength < wordLength) {
      shortest = guess;
      longest = word;
      shortestLength = guessLength;
    } else {
        shortest = word;
      longest = guess;
      shortestLength = wordLength;
    }

    offset = 0;
    for (charIndex = 0; charIndex < shortestLength; charIndex += 1) {
        if (shortest[charIndex] !== longest[offset + charIndex]) {
        if (offset > 0) {
            return false; // second error
        }
        offset = 1;
        if (shortest[charIndex] !== longest[offset + charIndex]) {
            return false; // second error
        }
      }
    }

    return true; // only one error
  }

  foundDiff = false;
  for (charIndex = 0; charIndex < guessLength; charIndex += 1) {
    if (guess[charIndex] !== word[charIndex]) {
      if (foundDiff) {
        return false;
      }
      foundDiff = true;
    }
  }
  return true;
};

I've updated your fiddle to include this method. Here are the results on my machine:

close: 154.61
lev: 176.72500000000002
sim: 32.48000000000013

Fiddle: https://jsfiddle.net/dylon/aubtze2L/11/

*"If the difference in lengths is exactly 1, then the shortest string must be equal to the prefix of the longest string, of the same length as the shortest string."* - what about "aa" and "baa"? "aa" is shorter but not a prefix of "baa"? — le_m, Jun 24 '16 at 00:03
Same with 'aaaa' and 'aabaa' - neither post- nor prefix. Hm, perhaps another rule needs to be found. — le_m, Jun 24 '16 at 00:33
You're right, I'm moving too quickly. I've never optimized for this use case, sorry. I'll think it through and update my solution again. — Dylon, Jun 24 '16 at 01:12

score 0 · Answer 3 · edited May 23 '17 at 10:29

0

If you know that you are looking for distance 0 and 1, then the general purpose DP algorithm does not make sense (and by the way the algorithm you showed looks convoluted, take a look at a better explanation here).

To check that the distance is 0, all you need is to check whether 2 strings are the same. Now if the distance is one, it means that either insertion, deletion or substitution should have happened. So generate all possible deletion from the original string and check whether it is equal to second string. So you will get something like this:

for (var i = 0; i < s_1.length; i++) {
  if s_2 == s_1.slice(0, i) + s_1.slice(i + 1) {
     return true
  }
}

For insertions and substitution you will need to know the alphabet of all characters. You can define it as a big string var alphabet = "abcde....". Now you do a similar thing, but when you introduce substitution or insertion, you also iterate over all elements in your alphabet. I am not planning to write the whole code here.

A couple of additional things. You can make a lot of micro-optimizations here. For example if the length of two strings are different by more than 1, they clearly can't have a distance 1. Another one relates to the frequencies of underlying characters in the string.

edited May 23 '17 at 10:29

Community

1
1

answered Jun 19 '16 at 05:52

Salvador Dali

214,103
147
703
753

I ran your code through the fiddle and just what you have posted is twice as slow as my original post. I think slicing that often is too inefficient to work well as an optimization. I'll try changing it to a for loop. – Lawrence Douglas Jun 19 '16 at 06:15
@LawrenceDouglas What kind of strings have you tested it against. Strings of length 5, or strings of length 1000? Whenever you benchmark anything, you need to know what are you benchmarking. My solution scales linearly in terms of length of strings and has something like 3 * lenOfAlphabet constant factor. DP solution scales quadratically in terms of length but has a small constant factor. – Salvador Dali Jun 19 '16 at 06:21
95% of my words are going to be length 5-15 so testing against that. – Lawrence Douglas Jun 19 '16 at 06:24
@LawrenceDouglas then just implement the algorithm I pointed in a link and create microoptimizations I mentioned. As I mentioned O(n) algorithm will have a big constant factor so it does not make sense to use it for such a small strings. – Salvador Dali Jun 19 '16 at 06:27
What you linked is a general lev distance algorithm. I already have one posted in the original post that is much faster. – Lawrence Douglas Jun 19 '16 at 06:36

How to optimize levenshtein distance for checking for a distance of 1?

3 Answers3