1

Problem

I'm trying to implement some sort of "fuzzy search" in my Node.js based project.

Fuzzy search is a search that returns results even if the string didn't match exactly.


I found this code in another stackoverflow thread. Code is below.

It's quite good, but the problem is - it's synchronous It slows down the whole program when it searches through a large array.

Question


ES6 methods are welcomed. Needs to work only in the latest Chrome, so any JS methods will work.


  • Are there any new JS methods that would optimize this function?

  • Am I doing something wrong there that makes it even slower?

  • Can I turn this function into an async function that returns a promise? Will it stop freezing the app during the search then?

  • Are there any better "fuzzy search" implementations you know of? (I found a module called fuzzysort, can't say if it's that much better though, it won't return "folder test" if you type "test folder" (wrong order) so it's not that good)

Code

Calling search function

searchArray is an array of paths it searches through, e.g.: ["C:\\test", "C:\\file.txt"...] (0.5 - 5 million paths)

searchQuery is a string without spaces, e.g.: filetxt

search () {
  const fuzzySearch = this.fuzzySearch(this.searchQuery.toLowerCase(), this.searchArray) 
  let result = fuzzySearch.filter(element => element.relevance >= 0.3)
  // sort by relevance
  var sortedResults = result.sort((a, b) => parseFloat(b.relevance) - parseFloat(a.relevance)).map(item => item.name);
      
  this.searchResults = sortedResults
},

The fuzzy search function

fuzzySearch (searchQuery, searchArray) {
  const get_bigrams = function(string) {
  const s = string.toLowerCase();
  const v = new Array(s.length - 1);
  for (let i = 0, end = v.length; i <= end; i++) {
      v[i] = s.slice(i, i + 2);
  }
  return v;
  };

  const string_similarity = function(str1, str2) {
      if ((str1.length > 0) && (str2.length > 0)) {
          const pairs1 = get_bigrams(str1);
          const pairs2 = get_bigrams(str2);
          const union = pairs1.length + pairs2.length;
          let hit_count = 0;
          for (let x of Array.from(pairs1)) {
              for (let y of Array.from(pairs2)) {
                  if (x === y) {
                      hit_count++;
                  }
              }
          }
          if (hit_count > 0) {
              return ((2.0 * hit_count) / union);
          }
      }
      return 0.0;
  };

  let results = [];
  for (let name of searchArray) {
    // I added .match to use only the base filename (name+ext) not the whole path, and removed all characters
    let filteredPath = name.match(/[^\\\/]+$/)[0].replace(/[^A-Za-z0-9.]+/g, '')
              
    const relevance = string_similarity(searchQuery, filteredPath);
    const obj = {name, relevance};
    results.push(obj);
  }
  return results
},
Community
  • 1
  • 1
Un1
  • 3,874
  • 12
  • 37
  • 79
  • what is joinedCache ? does it have the entire array? – karthick Jun 01 '18 at 22:16
  • @karthick sorry for the confusion, it's just an array of paths that it searches through (it's very big): `["C:\\test", "C:\\file.txt"...]` – Un1 Jun 01 '18 at 22:18
  • _"Node.js based project" "Needs to work only in the latest Chrome"_ Is `fuzzySearch` running in Node or Chrome? – AuxTaco Jun 01 '18 at 23:27
  • @AuxTaco well, both, it's an electron project. I just mentioned Node js because it also provides lots of useful modules that can be used for implementing a search. – Un1 Jun 02 '18 at 09:43
  • Are those paths the actual strings you’re filtering or are you reading and filtering their content ? Also how often the array of paths update ? – user3210641 Jun 02 '18 at 15:53
  • @user3210641 It's reading just the paths, its only task is to return the paths that "kind of" or "exactly" match the query (hence "fuzzy search"), so if you give it query `"test"` it should only return `"C:\test"` from this array `["C:\test", C:\bob]`. It always gets a ready to use array of paths (0.5 - 5 million), so it only needs to find the ones that match – Un1 Jun 02 '18 at 16:26
  • A rather obvious optimisation is not to call `get_bigrams` on the `searchQuery` for every comparison again. A not so obvious but very important one is not to use nested loops to count the similiarities. Use an efficient lookup structure, e.g. a `Set`, instead. Further optimisations would depend on the `searchArray` - is it the same array used in multiple searches, are there many items that share large parts? – Bergi Jun 02 '18 at 21:43
  • "*Will turning this function into an async function stop freezing the app during the search?*" - no. For that, you'd need to actually defer work, e.g. make pauses after every 1000th iteration. (`async`/`await` can simplify implementing this, though). – Bergi Jun 02 '18 at 21:48
  • @Bergi thank you for the tips, mate. Yes it's the same array, though it slices off everything but the base filename before trying to match anything, and if the base filename matches the query it returns its full path. – Un1 Jun 02 '18 at 21:56

0 Answers0