2

Essentially, I have a Set of words, about 250,000 of them and want to be able to return a list of which ones are found in a given string.

eg. input string is 'APPLEASEDITION', I want to return

[APP,APPLE,PLEA, PLEAS,PLEASE,PLEASED,lEA,LEAS,LEASE,LEASED,EA,EAS,EASE,EASED,AS,SEDITION,EDITION,IT,TI,ON]

I came up with this code, which works faster than the method mentioned above for shorter input strings (up to 15 characters), but doubles in execution time with each added letter:

const findWords = (instring, solutions = null) => {
  if (!solutions) solutions = new Set();
  if (!instring) {
    return new Set();
  }
  if (words[instring]) {
    solutions.add(instring);
  }
  const suffix = instring.slice(1);
  const prefix = instring.slice(0, instring.length - 1);

  if (!solutions.has(prefix))
    solutions = new Set([...solutions, ...findWords(prefix, solutions)]);
  if (!solutions.has(suffix))
    solutions = new Set([...solutions, ...findWords(suffix, solutions)]);
  return solutions;
};

Wondering if anyone can help me out optimizing the code?

Edit:

I made a different solution, it works much better

const getAllSubstrings = (str) => {
  let result = [];

  for (let i = 0; i < str.length; i++) {
      for (let j = i + 1; j < str.length + 1; j++) {
          result.push(str.slice(i, j));
      }
  }
  return result;
}

const findWords = (instring) => {
  const solutions = []
  let subs = getAllSubstrings(instring)
  for (let sub of subs) {
    if (words[sub])
      solutions.push(sub)
  }
  return solutions
}

Still open to possible improvements, but this works well enough for my use case

danikitty
  • 31
  • 3
  • 1
    You probably need to rely on some kind of heuristics or index to generate the list: what you're doing is simply brute-forcing your way through all combinations, which is inefficient and O(n) complex. – Terry Jan 24 '22 at 23:19
  • Have you considered doing this server-side? Or using an existing library (probably one that is written in WASM)? JS is single-threaded, so you'll want something that does this efficiently (like WASM) off the main thread – Enijar Jan 24 '22 at 23:20
  • Have you looked at using a [trie](https://en.wikipedia.org/wiki/Trie)? That definitely seems the right approach for the problem, especially if you want to reuse the dictionary for multiple test cases. – Scott Sauyet Jan 25 '22 at 04:28

1 Answers1

2
  1. As it stands your logic assumes your input starts or ends with the phrase, but doesn't consider words in the middle - you'll need to generate permutations

  2. Convert your dictionary to a hash where the words are keys - O(n) => O(1) - you can check if possible words are in the dictionary by checking dictionary[possibleWord]

  3. You could convert your array of dictionary words into a binary search tree or a trie - there might be a performance benefit to converting your source text to a collection of BSTs/Tries, where each one represents a possible word/permutation, and then comparing BSTs/Tries rather than strings, but I'm not sure how that'd be faster than string comparison at the moment.

You can limit the length to the max length of a given permutation to the words in your dictionary. You'll end up with a lot of permutations, but possibly less than you have currently.

As the comments state you may want to do this server side for more power/in a language more efficient than JS, or using WASM.

Some javascript libraries that have binary search tree tools:

  1. Alternatively, you might be able to create two hashes (one of permutations, one of dictionary words), or another data structure that's made for creating a "diff" or "overlap", and extract the keys that are in both sets.
Max Hudson
  • 9,961
  • 14
  • 57
  • 107
  • maybe I'm not understanding, but It's taking O(1) to check if a word is in my dictionary, my dictionary is a Set, that's not the slow down. What's taking a long time is checking each substring of the input string – danikitty Jan 24 '22 at 23:41
  • My code words even when the input does not start or end with a valid word, not sure I follow your first point – danikitty Jan 24 '22 at 23:48
  • To point 2 - already done, see line 6 of the code – danikitty Jan 24 '22 at 23:49
  • Where my issue stands is generating the permutations of the input string. currently dropping the first letter and recursively calling the function, dropping the last letter and recursively calling the function. – danikitty Jan 24 '22 at 23:52
  • 1
    @danikitty Max's 3rd suggestion gets my vote. When storing the dictionary, the benefit of a [trie](https://en.wikipedia.org/wiki/Trie) over a set is that while parsing the input string, you'll know when to stop, whereas with a set this isn't possible. Your second code example shows the inner loop having to go the full length of the input due to the limitation of the set both 1) not linking you to the next possible extensions and 2) not letting you know whether at a terminal node... – Trentium Jan 25 '22 at 01:57