2

I've got a large array of words in Javascript (~100,000), and I'd like to be able to quickly return a subset of them based on a text pattern.

For example, I'd like to return all the words that begin with a pattern so typing hap should give me ["happy", "happiness", "happening", etc, etc], as a result.

If it's possible I'd like to do this without iterating over the entire array.

Something like this is not working fast enough:

// data contains an array of beginnings of words e.g. 'hap'
$.each(data, function(key, possibleWord) {
 found = $.inArray(possibleWord, words);
 // do something if found
}

Any ideas on how I could quickly reduce the set to possible matches without iterating over the whole word set? The word array is in alphabetical order if that helps.

6 Answers6

3

If you just want to search for prefixes there are data structures just for that, such as the Trie and Ternary search trees

A quick Google search and some promissing Javascrit Trie and autocomplete implementations show up:

http://ejohn.org/blog/javascript-trie-performance-analysis/

Autocomplete using a trie

http://odhyan.com/blog/2010/11/trie-implementation-in-javascript/

Community
  • 1
  • 1
hugomg
  • 68,213
  • 24
  • 160
  • 246
1

I have absolutely no idea if this is any faster (a jsperf test is probably in order...), but you can do it with one giant string and a RegExp search instead of arrays:

var giantStringOfWords = giantArrayOfWords.join(' ');
function searchForBeginning(beginning, str) {
    var pattern = new RegExp('\\b' + str + '\\w*'),
        matches = str.match(pattern);
    return matches;
}

var hapResults = searchForBeginning('hap', giantStringOfWords);
sdleihssirhc
  • 42,000
  • 6
  • 53
  • 67
0

The best approach is to structure the data better. Make an object with keys like "hap". That member holds an array of words (or word suffixes if you want to save space) or a separated string of words for regexp searching.

This means you will have shorter objects to iterate/search. Another way is to sort the arrays and use a binary search pattern. There's a good conversation about techniques and optimizations here: http://ejohn.org/blog/revised-javascript-dictionary-search/

AutoSponge
  • 1,444
  • 10
  • 7
0

I suppose that using raw javascript can help a bit, you can do:

var arr = ["happy", "happiness", "nothere", "notHereEither", "happening"], subset = [];
for(var i = 0, len = arr.length; i < len; i ++) {
     if(arr[i].search("hap") !== -1) {
           subset.push(arr[i]);
     }
}
//subset === ["happy", "happiness","happening"]

Also, if the array is ordered you could break early if the first letter is bigger than the first of your search, instead of looping the entire array.

nicosantangelo
  • 13,216
  • 3
  • 33
  • 47
  • 2
    You might want to do something like `arr[i].toLowerCase().search(query.toLowerCase()) !== -1` when searching, for case-insensitive results. – Naftuli Kay Nov 18 '11 at 23:53
0
var data = ['foo', 'happy', 'happiness', 'foohap'];    
jQuery.each(data, function(i, item) {
      if(item.match(/^hap/))
        console.log(item) 
    });

If you have the data in an array, you're going to have to loop through the whole thing.

Jordan Brown
  • 13,603
  • 6
  • 30
  • 29
0

A really simple optimization is on page load go through your big words array and make a note of what index ranges apply to each starting letter. E.g., in my example below the "a" words go from 0 to 2, "b" words go from 3 to 4, etc. Then when actually doing a pattern match only look through the applicable range. Although obviously some letters will have more words than others, a given search will only have to look through an average of 100,000/26 words.

// words array assumed to be lowercase and in alphabetical order
var words = ["a","an","and","be","blue","cast","etc."];

// figure out the index for the first and last word starting with
// each letter of the alphabet, so that later searches can use
// just the appropriate range instead of searching the whole array
var letterIndexes = {},
    i,
    l,
    letterIndex = 0,
    firstLetter;
for (i=0, l=words.length; i<l; i++) {
    if (words[i].charAt(0) === firstLetter)
       continue;
    if (firstLetter)
        letterIndexes[firstLetter] = {first : letterIndex, last : i-1};
    letterIndex = i;
    firstLetter = words[i].charAt(0);
}

function getSubset(pattern) {
    pattern = pattern.toLowerCase()
    var subset = [],
        fl = pattern.charAt(0),
        matched = false;
    if (letterIndexes[firstLetter])
        for (var i = letterIndexes[fl].first, l = letterIndex[fl].last; i <= l; i++) {
            if (pattern === words[i].substr(0, pattern.length)) {
                subset.push(words[i]);
                matched = true;
            } else if (matched) {
                break;
            }
        }
    return subset;        
}

Note also that when searching through the (range within the) words array, once a match is found I set a flag, which indicates we've gone past all of the words that are alphabetically before the pattern and are now making our way through the matching words. That way as soon as the pattern no longer matches we can break out of the loop. If the pattern doesn't match at all we still end up going through all the words for that first letter though.

Also, if you're doing this as a user types, when letters are added to the end of the pattern you only have to search through the previous subset, not through the whole list.

P.S. Of course if you want to break the word list up by first letter you could easily do that server-side.

nnnnnn
  • 147,572
  • 30
  • 200
  • 241