55

So I have a random javascript array of names...

[@larry,@nicholas,@notch] etc.

They all start with the @ symbol. I'd like to sort them by the Levenshtein Distance so that the the ones at the top of the list are closest to the search term. At the moment, I have some javascript that uses jQuery's .grep() on it using javascript .match() method around the entered search term on key press:

(code edited since first publish)

limitArr = $.grep(imTheCallback, function(n){
    return n.match(searchy.toLowerCase())
});
modArr = limitArr.sort(levenshtein(searchy.toLowerCase(), 50))
if (modArr[0].substr(0, 1) == '@') {
    if (atRes.childred('div').length < 6) {
        modArr.forEach(function(i){
            atRes.append('<div class="oneResult">' + i + '</div>');
        });
    }
} else if (modArr[0].substr(0, 1) == '#') {
    if (tagRes.children('div').length < 6) {
        modArr.forEach(function(i){
            tagRes.append('<div class="oneResult">' + i + '</div>');
        });
    }
}

$('.oneResult:first-child').addClass('active');

$('.oneResult').click(function(){
    window.location.href = 'http://hashtag.ly/' + $(this).html();
});

It also has some if statements detecting if the array contains hashtags (#) or mentions (@). Ignore that. The imTheCallback is the array of names, either hashtags or mentions, then modArr is the array sorted. Then the .atResults and .tagResults elements are the elements that it appends each time in the array to, this forms a list of names based on the entered search terms.

I also have the Levenshtein Distance algorithm:

var levenshtein = function(min, split) {
    // Levenshtein Algorithm Revisited - WebReflection
    try {
        split = !("0")[0]
    } catch(i) {
        split = true
    };

    return function(a, b) {
        if (a == b)
            return 0;
        if (!a.length || !b.length)
            return b.length || a.length;
        if (split) {
            a = a.split("");
            b = b.split("")
        };
        var len1 = a.length + 1,
            len2 = b.length + 1,
            I = 0,
            i = 0,
            d = [[0]],
            c, j, J;
        while (++i < len2)
            d[0][i] = i;
        i = 0;
        while (++i < len1) {
            J = j = 0;
            c = a[I];
            d[i] = [i];
            while(++j < len2) {
                d[i][j] = min(d[I][j] + 1, d[i][J] + 1, d[I][J] + (c != b[J]));
                ++J;
            };
            ++I;
        };
        return d[len1 - 1][len2 - 1];
    }
}(Math.min, false);

How can I work with algorithm (or a similar one) into my current code to sort it without bad performance?

UPDATE:

So I'm now using James Westgate's Lev Dist function. Works WAYYYY fast. So performance is solved, the issue now is using it with source...

modArr = limitArr.sort(function(a, b){
    levDist(a, searchy)
    levDist(b, searchy)
});

My problem now is general understanding on using the .sort() method. Help is appreciated, thanks.

Thanks!

icedwater
  • 4,701
  • 3
  • 35
  • 50
alt
  • 13,357
  • 19
  • 80
  • 120
  • 2
    If it's an array, why are you iterating with `for..in`? That will also iterate over the `length` property (or any other non-index property of the array, if you defined such), and inherited enumerable properties (which might exist if some of your other code tries to polyfill the ES5 array methods). You can iterate arrays with the native `.forEach`, or jQuery's `$.each`. – Šime Vidas Aug 12 '12 at 01:44
  • Shouldn't it be `if (modArr[i]` instead of `if (modArr[0]`? – Šime Vidas Aug 12 '12 at 01:48
  • How many `.atResults`, and `.tagResults` elements are there on the page? One of each? Is their number static, or are you adding new ones dynamically? If their number is static, you should cache their references, so that you don't have to query them on each key up. – Šime Vidas Aug 12 '12 at 01:51
  • To your first comment. I'm not using the forEach method, much nicer! For the second question, it didn't really matter, that if statement should've been outside the for...in statement anyway. But I've since moved to the .forEach method, it must be outside and so I'll just grab the first one. And third question, there's only one of each element on the page, and now they're cached. Thanks. – alt Aug 14 '12 at 05:31
  • 1
    modArr = limitArr.sort(function(a,b){ return levDist(b,searchy) - levDist(a,searchy); }); – James Westgate Aug 21 '12 at 11:41
  • 1
    @JamesWestgate to the rescue again. (Note: that will sort in ascending order of similarity, which might not be the order you want; to sort in _descending_ order of likeness, use `function(a,b){ return levDist(a, searchy) - levDist(b,searchy); }`.) – Jordan Gray Aug 22 '12 at 00:10
  • 2
    Yeah, and you may want to cache the result depending on the size of the array as it may do the same calculation multiple times. Just a simple associative array would do the trick. – James Westgate Aug 22 '12 at 09:17
  • I have also found soundex algorithms to be useful in trying to spell check / match words. – James Westgate Aug 23 '12 at 10:29
  • So what you could do is put the names to be searched in buckets by their length. THen you can reduce the number of terms to check against because you only need to check the buckets within the same length as the search term +- limit. In that way you can remove the limit check in my jsperf version of the algorithm. – James Westgate Aug 24 '12 at 14:18
  • It compiles pretty well with closure compiler, a 50% size compression with advanced optimization with one modification: `this["levenshtein"] = function(min, split) {` on the first line to avoiding it compiling down to nothing. –  Mar 18 '17 at 14:37

7 Answers7

115

I wrote an inline spell checker a few years ago and implemented a Levenshtein algorithm - since it was inline and for IE8 I did quite a lot of performance optimisation.

var levDist = function(s, t) {
    var d = []; //2d matrix

    // Step 1
    var n = s.length;
    var m = t.length;

    if (n == 0) return m;
    if (m == 0) return n;

    //Create an array of arrays in javascript (a descending loop is quicker)
    for (var i = n; i >= 0; i--) d[i] = [];

    // Step 2
    for (var i = n; i >= 0; i--) d[i][0] = i;
    for (var j = m; j >= 0; j--) d[0][j] = j;

    // Step 3
    for (var i = 1; i <= n; i++) {
        var s_i = s.charAt(i - 1);

        // Step 4
        for (var j = 1; j <= m; j++) {

            //Check the jagged ld total so far
            if (i == j && d[i][j] > 4) return n;

            var t_j = t.charAt(j - 1);
            var cost = (s_i == t_j) ? 0 : 1; // Step 5

            //Calculate the minimum
            var mi = d[i - 1][j] + 1;
            var b = d[i][j - 1] + 1;
            var c = d[i - 1][j - 1] + cost;

            if (b < mi) mi = b;
            if (c < mi) mi = c;

            d[i][j] = mi; // Step 6

            //Damerau transposition
            if (i > 1 && j > 1 && s_i == t.charAt(j - 2) && s.charAt(i - 2) == t_j) {
                d[i][j] = Math.min(d[i][j], d[i - 2][j - 2] + cost);
            }
        }
    }

    // Step 7
    return d[n][m];
}
Michal
  • 3,584
  • 9
  • 47
  • 74
James Westgate
  • 11,306
  • 8
  • 61
  • 68
  • 2
    I haven't gone through the code, but I've done a quick test and can verify that this is significantly faster than the OP's algorithm. Nice work! – Jordan Gray Aug 15 '12 at 13:57
  • I can see a few more tweaks for performance such as === over == – James Westgate Aug 15 '12 at 15:46
  • 2
    One little thing: I notice that this is the Damerau–Levenshtein distance you calculate rather than just the Levenshtein distance. If you cut out the test for transpositions, this will be closer to what the OP wants and will run even faster. :) – Jordan Gray Aug 15 '12 at 15:48
  • 5
    Fun fact, the `var` keyword serves an actual purpose and isn't a reset button. – AlienWebguy Aug 18 '12 at 05:40
  • 3
    Thank you! This is honestly the only good performance function I've seen on the web. All the others get really slow on long string, this is fantastic. Well done. *claps hands* – alt Aug 18 '12 at 16:54
  • So now I have the distance, how would I best use this function with javascript's .sort() method? Thanks in advance. – alt Aug 18 '12 at 16:59
  • Your sort will take two arguments which are two elements to compare in the list You want to return the two lev values for the two elements vs the search term. You probably want to write a self caching function for that. Hope that makes sense. – James Westgate Aug 18 '12 at 18:07
  • Any code I can look at? Examples of using sort with something like this? Thanks again. – alt Aug 18 '12 at 23:51
  • And I posted an example of something I tried in the original question. Thx. – alt Aug 19 '12 at 00:02
  • 1
    @JamesWestgate I threw up a JSPerf page to compare a few implementations, and yours performs brilliantly. (You could make it even faster by implementing a limit and getting rid of the Damerau transposition.) – Jordan Gray Aug 20 '12 at 10:30
  • @JordanGray could you take a look at my code in the original question and see if I'm implementing the .sort() correctly? Thanks. – alt Aug 21 '12 at 09:37
  • Bah, I forgot to post the link to the JSPerf page: http://jsperf.com/levenshtein-distance – Jordan Gray Aug 23 '12 at 09:39
  • 2
    If you set a limit, and you take out DT, then the performance flies. http://jsperf.com/levenshtein-distance/2 Note: my original implementation had buckets of items sorted alphabetically, then each letter of the alphabet also had a bucket by length, in that way I could compare items only of the same length or lengths up to a limit without the check inside the algorithm. – James Westgate Aug 23 '12 at 10:23
  • Running tests on a few different browsers gives very interesting results. Your algorithm frequently comes out on top—especially considering the one that runs without a limit—but Firefox seems to LOOOVE Stoianovici's. Anomalously so; I expect this is due to some oddity of the JS engine rather than an intrinsic algorithmic factor. – Jordan Gray Aug 24 '12 at 14:01
  • This answer continues to get up voted which is great. Looking at it now, there are definitely ways of improving it further, specifically around memory management / recreation of the matrix variable d, and possibly those var declarations inside the loop. Javascript engines have of course moved on since I wrote this for IE8, which was the latest release of IE at the time, so it may not be as important to get this right as it once was. – James Westgate Dec 10 '13 at 11:09
  • Also potentially worth looking at this [implementation by Andrew Hedges](http://andrew.hedges.name/experiments/levenshtein/), code [here](http://andrew.hedges.name/experiments/levenshtein/levenshtein.js). I added it to [revision 17 in the jsPerf](http://jsperf.com/levenshtein-distance/17). On Chrome 31 it practically tied with Westgate and Stoianovici, would be interesting to see how it performs on other browsers – tobek Jan 04 '14 at 20:23
  • Someone may also want to digest this algorithm and see if it makes a difference -> http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/ – James Westgate Jan 13 '14 at 14:05
  • Thanks for the great answer, a little bug in the code though - in the damerau transposition we should have `i > 2 && j > 2` rather than `i > 1 && j > 1` - otherwise it may throw an exception. – user3141326 Feb 04 '15 at 12:38
  • 1
    Here's a different algorithm that may provide a performance increase - I havent though - tried it http://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html – James Westgate Oct 28 '15 at 14:51
  • Unless I've implemented it wrong, there may be a bug with the Damerau transposition: This code seems to produce the result of 5 for ("84631","7345"), when the correct result is 4 (substitute 8 with 7, transpose 3 & 4, delete 6, substitute 1 with 5). – Isaac Sep 14 '17 at 09:58
  • @Isaac look at the comment above which may help. – James Westgate Sep 15 '17 at 09:24
  • > `if (i == j && d[i][j] > 4) return n;` What is this line supposed to do? It cheks if any diagonal (except `d[0][0]`) is greater than 4, but no such value exists before the loop (only the first row and col has any), and when d is mutated inside the loop, it is only the *current* cell `d[i][j]`, and only *below* this line. I see no case where the cell already has a value at that point? – felix Jul 13 '21 at 01:24
13

I came to this solution:

var levenshtein = (function() {
        var row2 = [];
        return function(s1, s2) {
            if (s1 === s2) {
                return 0;
            } else {
                var s1_len = s1.length, s2_len = s2.length;
                if (s1_len && s2_len) {
                    var i1 = 0, i2 = 0, a, b, c, c2, row = row2;
                    while (i1 < s1_len)
                        row[i1] = ++i1;
                    while (i2 < s2_len) {
                        c2 = s2.charCodeAt(i2);
                        a = i2;
                        ++i2;
                        b = i2;
                        for (i1 = 0; i1 < s1_len; ++i1) {
                            c = a + (s1.charCodeAt(i1) === c2 ? 0 : 1);
                            a = row[i1];
                            b = b < a ? (b < c ? b + 1 : c) : (a < c ? a + 1 : c);
                            row[i1] = b;
                        }
                    }
                    return b;
                } else {
                    return s1_len + s2_len;
                }
            }
        };
})();

See also http://jsperf.com/levenshtein-distance/12

Most speed was gained by eliminating some array usages.

Marco de Wit
  • 2,686
  • 18
  • 22
6

Updated: http://jsperf.com/levenshtein-distance/5

The new Revision annihilates all other benchmarks. I was specifically chasing Chromium/Firefox performance as I don't have an IE8/9/10 test environment, but the optimisations made should apply in general to most browsers.

Levenshtein Distance

The matrix to perform Levenshtein Distance can be reused again and again. This was an obvious target for optimisation (but be careful, this now imposes a limit on string length (unless you were to resize the matrix dynamically)).

The only option for optimisation not pursued in jsPerf Revision 5 is memoisation. Depending on your use of Levenshtein Distance, this could help drastically but was omitted due to its implementation specific nature.

// Cache the matrix. Note this implementation is limited to
// strings of 64 char or less. This could be altered to update
// dynamically, or a larger value could be used.
var matrix = [];
for (var i = 0; i < 64; i++) {
    matrix[i] = [i];
    matrix[i].length = 64;
}
for (var i = 0; i < 64; i++) {
    matrix[0][i] = i;
}

// Functional implementation of Levenshtein Distance.
String.levenshteinDistance = function(__this, that, limit) {
    var thisLength = __this.length, thatLength = that.length;

    if (Math.abs(thisLength - thatLength) > (limit || 32)) return limit || 32;
    if (thisLength === 0) return thatLength;
    if (thatLength === 0) return thisLength;

    // Calculate matrix.
    var this_i, that_j, cost, min, t;
    for (i = 1; i <= thisLength; ++i) {
        this_i = __this[i-1];

        for (j = 1; j <= thatLength; ++j) {
            // Check the jagged ld total so far
            if (i === j && matrix[i][j] > 4) return thisLength;

            that_j = that[j-1];
            cost = (this_i === that_j) ? 0 : 1;  // Chars already match, no ++op to count.
            // Calculate the minimum (much faster than Math.min(...)).
            min    = matrix[i - 1][j    ] + 1;                      // Deletion.
            if ((t = matrix[i    ][j - 1] + 1   ) < min) min = t;   // Insertion.
            if ((t = matrix[i - 1][j - 1] + cost) < min) min = t;   // Substitution.

            matrix[i][j] = min; // Update matrix.
        }
    }

    return matrix[thisLength][thatLength];
};

Damerau-Levenshtein Distance

jsperf.com/damerau-levenshtein-distance

Damerau-Levenshtein Distance is a small modification to Levenshtein Distance to include transpositions. There is very little to optimise.

// Damerau transposition.
if (i > 1 && j > 1 && this_i === that[j-2] && this[i-2] === that_j
&& (t = matrix[i-2][j-2]+cost) < matrix[i][j]) matrix[i][j] = t;

Sorting Algorithm

The second part of this answer is to choose an appropriate sort function. I will upload optimised sort functions to http://jsperf.com/sort soon.

  • Welcome to Stack Overflow! Thanks for your post! Please do not use signatures/taglines in your posts. Your user box counts as your signature, and you can use your profile to post any information about yourself you like. [FAQ on signatures/taglines](http://stackoverflow.com/faq#signatures) – Andrew Barber Feb 21 '13 at 21:11
  • Hi! Sorry, my mistake... I assumed stackoverflow was going to post my message under "anonymous" or "guest"... I see it has created a pseudo account... guess I'll register fully. – TheSpanishInquisition Feb 21 '13 at 21:40
  • 1
    Hi, the prototype methods you include are bound to be slower so I cannot see their relevance in these tests. – StuR Jul 30 '13 at 13:41
4

I implemented a very performant implementation of levenshtein distance calculation if you still need this.

function levenshtein(s, t) {
    if (s === t) {
        return 0;
    }
    var n = s.length, m = t.length;
    if (n === 0 || m === 0) {
        return n + m;
    }
    var x = 0, y, a, b, c, d, g, h, k;
    var p = new Array(n);
    for (y = 0; y < n;) {
        p[y] = ++y;
    }

    for (; (x + 3) < m; x += 4) {
        var e1 = t.charCodeAt(x);
        var e2 = t.charCodeAt(x + 1);
        var e3 = t.charCodeAt(x + 2);
        var e4 = t.charCodeAt(x + 3);
        c = x;
        b = x + 1;
        d = x + 2;
        g = x + 3;
        h = x + 4;
        for (y = 0; y < n; y++) {
            k = s.charCodeAt(y);
            a = p[y];
            if (a < c || b < c) {
                c = (a > b ? b + 1 : a + 1);
            }
            else {
                if (e1 !== k) {
                    c++;
                }
            }

            if (c < b || d < b) {
                b = (c > d ? d + 1 : c + 1);
            }
            else {
                if (e2 !== k) {
                    b++;
                }
            }

            if (b < d || g < d) {
                d = (b > g ? g + 1 : b + 1);
            }
            else {
                if (e3 !== k) {
                    d++;
                }
            }

            if (d < g || h < g) {
                g = (d > h ? h + 1 : d + 1);
            }
            else {
                if (e4 !== k) {
                    g++;
                }
            }
            p[y] = h = g;
            g = d;
            d = b;
            b = c;
            c = a;
        }
    }

    for (; x < m;) {
        var e = t.charCodeAt(x);
        c = x;
        d = ++x;
        for (y = 0; y < n; y++) {
            a = p[y];
            if (a < c || d < c) {
                d = (a > d ? d + 1 : a + 1);
            }
            else {
                if (e !== s.charCodeAt(y)) {
                    d = c + 1;
                }
                else {
                    d = c;
                }
            }
            p[y] = d;
            c = a;
        }
        h = d;
    }

    return h;
}

It was my answer to a similar SO question Fastest general purpose Levenshtein Javascript implementation

Update

A improved version of the above is now on github/npm see https://github.com/gustf/js-levenshtein

gustf
  • 1,959
  • 13
  • 20
  • Please check out [Take a tour](http://stackoverflow.com/tour) and [Your answer is in another castle: When is an answer not an answer](http://meta.stackexchange.com/questions/225370) – Drew Feb 28 '16 at 14:37
  • @Drew Thanks, it was a bit unclear that the link was to another similar SO question. I changed it a bit. All good? – gustf Feb 28 '16 at 14:48
  • 1
    One way to think of it is, if your website link is not there, is that an answer. For 27, it is over 400 lines of code. So not sure what much to say. Not too easy to plop that in here. – Drew Feb 28 '16 at 14:54
  • 1
    Ok, I get your point. Added the implementation also now and also made a few other improvements. Thanks for taking the time to review my answer, I am still learning the SO way. – gustf Feb 28 '16 at 15:06
2

The obvious way of doing this is to map each string to a (distance, string) pair, then sort this list, then drop the distances again. This way you ensure the levenstein distance only has to be computed once. Maybe merge duplicates first, too.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
2

I would definitely suggest using a better Levenshtein method like the one in @James Westgate's answer.

That said, DOM manipulations are often a great expense. You can certainly improve your jQuery usage.

Your loops are rather small in the example above, but concatenating the generated html for each oneResult into a single string and doing one append at the end of the loop will be much more efficient.

Your selectors are slow. $('.oneResult') will search all elements in the DOM and test their className in older IE browsers. You may want to consider something like atRes.find('.oneResult') to scope the search.

In the case of adding the click handlers, we may want to do one better avoid setting handlers on every keyup. You could leverage event delegation by setting a single handler on atRest for all results in the same block you are setting the keyup handler:

atRest.on('click', '.oneResult', function(){
  window.location.href = 'http://hashtag.ly/' + $(this).html();
});

See http://api.jquery.com/on/ for more info.

Jacob Swartwood
  • 4,075
  • 1
  • 16
  • 17
  • I believe that the question asked about Levenstein, however your answer was about Jquery. Please consider revising your response into a comment. – Jack G Mar 03 '19 at 18:38
1

I just wrote an new revision: http://jsperf.com/levenshtein-algorithms/16

function levenshtein(a, b) {
  if (a === b) return 0;

  var aLen = a.length;
  var bLen = b.length;

  if (0 === aLen) return bLen;
  if (0 === bLen) return aLen;

  var len = aLen + 1;
  var v0 = new Array(len);
  var v1 = new Array(len);

  var i = 0;
  var j = 0;
  var c2, min, tmp;

  while (i < len) v0[i] = i++;

  while (j < bLen) {
    c2 = b.charAt(j++);
    v1[0] = j;
    i = 0;

    while (i < aLen) {
      min = v0[i] - (a.charAt(i) === c2 ? 1 : 0);
      if (v1[i] < min) min = v1[i];
      if (v0[++i] < min) min = v0[i];
      v1[i] = min + 1;
    }

    tmp = v0;
    v0 = v1;
    v1 = tmp;
  }
  return v0[aLen];
}

This revision is faster than the other ones. Works even on IE =)

gtournie
  • 4,143
  • 1
  • 21
  • 22