6

For a client side search tool I need to find the Levenshtein distance of a word with millions of other words. A user should be able to compare a short text of about twenty words with a book. The user could do this by finding locations of the most characterizing words of the text in the book. 'Finding locations'does not mean looking for an exact match but nearly match as with levenshtein. I started with already available implementations but I needed more speed. I ended up with this:

var rowA = new Uint16Array(1e6);
var rowB = new Uint16Array(1e6);
function levenshtein(s1, s2) {
    var s1_len = s1.length, s2_len = s2.length, i1, i2 = 0, a, b, c, c2, i = 0;
    if (s1_len === 0)
        return s2_len;
    if (s2_len === 0)
        return s1_len;
    while (i < s1_len)
        rowA[i] = ++i;
    while (i2 < s2_len) {
        c2 = s2[i2];
        a = i2;
        ++i2;
        b = i2;
        for (i1 = 0; i1 < s1_len; ++i1) {
            c = a + (s1[i1] !== c2 );
            a = rowA[i1];
            b = b < a ? (b < c ? b + 1 : c) : (a < c ? a + 1 : c);
            rowB[i1] = b;
        }
        if (i2 === s2_len)
            return b;
        c2 = s2[i2];
        a = i2;
        ++i2;
        b = i2;
        for (i1 = 0; i1 < s1_len; ++i1) {
            c = a + (s1[i1] !== c2 );
            a = rowB[i1];
            b = b < a ? (b < c ? b + 1 : c) : (a < c ? a + 1 : c);
            rowA[i1] = b;
        }
    }
    return b;
}

As you see I used techniques like placing the objects out of the function in order to re use them. I also repeated myself a bit by linearizing the loop somewhat. Could it be faster? I am curious of your advice.

Update: After tips from Bergi and some more thinking I came to this solution:

    var row = new Uint16Array(1e6);
    function levenshtein(s1, s2) {
        var s1_len = s1.length, s2_len = s2.length, i2 = 1, a, b = 0, c, c2, i1 = 0;
        if (s1_len === 0)
            return s2_len;
        if (s2_len === 0)
            return s1_len;
        c2 = s2[0];
        if (s1[0] === c2) {
            while (i1 < s1_len) {
                row[i1] = i1++;
            }
            b = s1_len - 1;
        } else {
            row[0] = 1;
            ++b;
            if (s1_len > 1)
                for (i1 = 1; i1 < s1_len; ++i1) {
                    if (s1[i1] === c2) {
                        row[i1] = b;
                        for (++i1; i1 < s1_len; ++i1) {
                            row[i1] = ++b;
                        }
                    } else {
                        row[i1] = ++b;
                    }
                }
        }
        if (s2_len > 1)
            while (i2 < s2_len) {
                c2 = s2[i2];
                c = i2 + (s1[0] !== c2);
                a = row[0];
                ++i2;
                b = i2 < a ? (i2 < c ? i2 + 1 : c) : (a < c ? a + 1 : c);
                row[0] = b;
                if (s1_len > 1) {
                    for (i1 = 1; i1 < s1_len; ++i1) {
                        c = a + (s1[i1] !== c2);
                        a = row[i1];
                        b = b < a ? (b < c ? b + 1 : c) : (a < c ? a + 1 : c);
                        row[i1] = b;
                    }
                }
            }
        return b;
    }

This is again much faster. I cannot squeeze more out of it. I keep looking for other ideas and will try some more.

Marco de Wit
  • 2,686
  • 18
  • 22
  • 4
    Are you familiar with this thread: http://stackoverflow.com/questions/11919065/sort-an-array-by-the-levenshtein-distance-with-best-performance-in-javascript ? – Jan.J Aug 26 '13 at 10:17
  • Yes I am, but levDist('knowledge','configured') gives me 8 while I expected 9. So I am not sure about it. – Marco de Wit Aug 26 '13 at 10:47
  • @MarcodeWit: The comments on the accepted answer explain that the code there does Damerau-Levensthein, which gives 8 for your words. – Bergi Aug 26 '13 at 10:56
  • @Bergi After deleting the Damerau transposition my algorithm is over six times as fast on Firefox. Mainly because of caching I suppose. – Marco de Wit Aug 26 '13 at 11:14
  • While I don't know how to make the algorithm itself faster I suggest you to try parallelize the task by using Web Workers. They sound like a ideal solution for your problem and will avoid freezing the UI. – Cesar Canassa Aug 26 '13 at 12:06

1 Answers1

2

Since you are comparing against the same word over and over, you can get a little performance improvement by using partial application and caching there:

function levenshtein(s1) {
    var row0 = [], row1 = [], s1_len = s1.length;
    if (s1_len === 0)
        return function(s2) {
            return s2.length;
        };
    return function(s2) {
        var s2_len = s2.length, s1_idx, s2_idx = 0, a, b, c, c2, i = 0;
        if (s2_len === 0)
            return s1_len;
        …
        return b;
    };
}

I also repeated myself a bit by linearizing the loop somewhat.

Not sure whether it gets lot faster, but you can omit one of the arrays - you do not need to read/write them in an alternating manner:

function levenshtein(s1) {
    var s1_len = s1.length, row = new Array(s1_len);
    if (s1_len === 0)
        return function(s2) {
            return s2.length;
        };
    return function(s2) {
        var s2_len = s2.length, s1_idx, s2_idx = 0, a, b, c, c2, i = 0;
        if (s2_len === 0)
            return s1_len;
        while (i < s1_len)
           row[i] = ++i;
        while (s2_idx < s2_len) {
            c2 = s2[s2_idx];
            a = s2_idx;
            ++s2_idx;
            b = s2_idx;
            for (s1_idx = 0; s1_idx < s1_len; ++s1_idx) {
                c = a + (s1[s1_idx] === c2 ? 0 : 1);
                a = row[s1_idx];
                b = b < a ? (b < c ? b + 1 : c) : (a < c ? a + 1 : c);
                row[s1_idx] = b;
            }
        }
        return b;
    };
}

I don't think further optimisations can be made without putting your millions of words in a dedicated data structure (e.g. a prefix trie).

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • Omitting one of the arrays was quite obvious. Strange I did not see it myself. – Marco de Wit Aug 26 '13 at 13:10
  • At first I had expected to need some additional code for accessing the previous-row overwritten value as well, before I noticed that it's already cached in `a` :-) If you need further optimisations, please tell us about the format of the million words, what exactly you are searching (sorting?) in them and what `s1` values you are expecting – Bergi Aug 26 '13 at 14:05
  • 1
    @MarcodeWit "putting your millions of words in a dedicated data structure (e.g. a prefix trie)" This is a huge win. – David Eisenstat Aug 26 '13 at 15:03
  • @bergi A user should be able to compare a short text of about twenty words with a book. The user could do this by finding locations of the most characterizing words of the text in the book. 'Finding locations'does not mean looking for an exact match but nearly match as with levenshtein. – Marco de Wit Aug 26 '13 at 15:10
  • @MarcodeWit: Oops, "finding words" is a little different task. Have you read http://en.wikipedia.org/wiki/Approximate_string_matching already? – Bergi Aug 26 '13 at 15:56
  • Yes, the idea is that the user can click some of the words in his little text and then the area in the book with the most fuzzy hits will be shown. Other area's will be visible by a heat map of the book. – Marco de Wit Aug 26 '13 at 16:02
  • So are you dividing your book into many non-overlapping, discrete areas (of any size) that shell be searched independently, or do you want a letter-exact continuous heat map? – Bergi Aug 26 '13 at 16:11
  • heat map will be based on density. For instance the amount of hits (levenshtein distance smaller than 10) within ten word from each position. But let's keep to the subject. I need this algorithm regularly and now want a good implementation. – Marco de Wit Aug 26 '13 at 19:57