0

Please look at this JSFiddle: https://jsfiddle.net/nu69kxyq/

This JS takes in input text file consisting of one word each line. The JS function finds longest and second longest COMPOUND word (made up of only other words in file) and total number of these compound words.

This is the exact input I'm using:

cat
cats
catsdogcats
dog
dogcatsdog
hippopotamuses
rat
ratcatdogcat

The output should be:

ratcatdogcat <-- longest compound word (12 characters)
catsdogcats <-- second longest compound word (11 characters)
3 <-- total number of compound words in a file

Total number of compound words is 3 because catsdogcats, dogcatsdog, ratcatdogcat.

I first take in all words then sort it in wordsList. Then I create a hash table of words for reference (to check for compound words) in wordDict:

var wordsList = list.sort(function(a, b) {
  return a.length - b.length; //sort words by length ascending
});

var wordDict = list.reduce(function(words, value, index) { //hash table from text file data
  words[value] = true;
  return words;
},{});

var isConcat = function(word) {
  for (var i = word.length; i > -1; i--){
    var prefix = word.slice(0,i);
    var suffix = word.slice(i, word.length);
    if (wordDict[prefix] === true){ //????? THIS IS ALWAYS FALSE EVEN WHEN THE KEY'S VALUE IS TRUE!!!
      if (suffix.length === 0) {
        return true; //all suffix are prefixes. word is concatenated.
      }
    return isConcat(suffix); //continue breaking remaining suffix into prefix if possible
    }
  }
  return false;
};

The problem is in isConcat, when I check against wordDict[prefix] to see if key's value is true, it is always false. Even when it should be true! I tried stepping through code and when key value is 'true' for 'word' key, the if (wordDict[prefix] === true) statement still doesn't execute because it thinks it's false. Using data.split("\n");, I'm able to read text file and put it in array. No issues there. So what is going wrong?

Note: I've tried using var list = data.match(/\w+/g); instead of var list = data.split("\n"); to match all alphanumeric characters as words, instead of splitting them by new line, and the function worked (wordDict[prefix] worked as expected). BUT this regex skips some words when I pass in a text file of over 150,000 text words. I think I need to use data.split("\n"). What is going wrong here?

RJK
  • 226
  • 1
  • 6
  • 22
  • Properties of objects in JavaScript cannot really be looked up by prefix, if that's what you are doing. For example `obj["foo"]` will not match `obj["foobar"]`. I've not looked through the entire code yet, but I noticed you were doing `obj[prefix]` – VLAZ Sep 20 '16 at 23:47
  • No, the code is splitting the word in two at every character and checking, so that's not the problem – qxz Sep 20 '16 at 23:50
  • @RJK Have you put some `console.log`s in to see what the values of `wordDict`, `prefix`, and `suffix` are? – qxz Sep 20 '16 at 23:51
  • @qxz I haven't specifically used `console.log` but I have stepped through the browser console debugger to determine that when `word[prefix]` is like this: `dog : true` it still returns false. What would you suggest in this situation? – RJK Sep 20 '16 at 23:53
  • @qxz `.match` does return all, as long as you have the global flag and there are multiple matches: `"abacad".match(/a/g) //[ "a", "a", "a" ]` – VLAZ Sep 20 '16 at 23:54
  • My bad, didn't read carefully – qxz Sep 20 '16 at 23:56
  • @vlaz for some reason, even with the global flag in my situation, it still failed to return some words in a text file of over 150,000 words. perhaps regex isnt the way to go here? – RJK Sep 20 '16 at 23:58
  • @qxz the long file i'm using must not be written in a way where it can just return all. how can i do this using `data.split("\n")`? – RJK Sep 21 '16 at 00:01
  • Are you on Windows? is it reading the file with `\r\n` line breaks? (grrr) – qxz Sep 21 '16 at 00:02
  • Wait, if you are splitting by `\n` is the file, by any chance, encoded in the Windows format - `\r\n`? – VLAZ Sep 21 '16 at 00:02
  • I'm on Windows yes. I'm not 100% sure on the line breaks...should I try with `\r\n`? – RJK Sep 21 '16 at 00:03
  • Dear God, I had this exact same problem and it took me so long – qxz Sep 21 '16 at 00:03
  • Give it a go. If you open the file in Notepad++ it should tell you which line endings is it using. – VLAZ Sep 21 '16 at 00:03
  • Try doing `data = data.replace(/\r\n/g, "\n");` to normalize it – qxz Sep 21 '16 at 00:04
  • It's in the bottom bar near the right corner - to the left of the encoding (likely UTF-8) it will say `Dos\Windows` if it uses `\r\n` or `UNIX` for `\n`. If you double-click on that value, you can switch to another, e.g., switch from Windows to UNIX – VLAZ Sep 21 '16 at 00:06
  • To quickly check, you can take a small file and do `console.log(JSON.stringify(fileContents));` – qxz Sep 21 '16 at 00:06
  • Also, another quick way to check, using Notepad++ - open search using Ctrl+F, switch the Search Mode to Extended and search for `\r` - if you find anything, then it's using that in the line ending – VLAZ Sep 21 '16 at 00:08
  • @qxz i tried using your replace and it seems to work for the input above, but it gives me `RangeError: Maximum call stack size exceeded(…)` when I use a large file! ughhhh – RJK Sep 21 '16 at 00:08
  • @vlaz i think your guys' advice works! it just gives me an error because i guess the array is too big :( – RJK Sep 21 '16 at 00:09
  • 1
    That's because your `isConcat` function is recursive. You should rewrite it to be iterative instead, e.g. by replacing the word with the suffix and looping – qxz Sep 21 '16 at 00:13
  • 1
    Ha! that's a problem with too much recursion. There are techniques to work around that, have a look [here](http://www.integralist.co.uk/posts/js-recursion.html) also, technically, you can move away from hogging the stack by using `setTimeout` (although you need to restructure how exactly you use that). [I wrote about that recently](http://stackoverflow.com/questions/39459236/understanding-event-queue-and-call-stack-in-javascript/39459913#39459913) although it's more of an intro level, it might help. – VLAZ Sep 21 '16 at 00:14
  • how can i give you guys reputation? thank you so much for your help – RJK Sep 21 '16 at 00:18
  • @qxz what did you mean specifically by replacing the word with suffix? – RJK Sep 21 '16 at 00:20
  • See my answer. I tested it and it works the same as yours. – qxz Sep 21 '16 at 00:58

1 Answers1

1

Here is a working implementation of isConcat that isn't recursive, and therefore won't have problems with stack overflow errors:

function isConcat(word) {
  for (var i = word.length; i >= 0; i--) {
    var prefix = word.slice(0, i);
    var suffix = word.slice(i);
    if (wordDict[prefix]) {
      if (suffix.length === 0) {
        return true; // all suffix are prefixes. word is concatenated.
      } else {
        // "restart" with suffix as the word
        word = suffix;
        i = word.length+1;
      }
    }
  }
  return false;
}

It essentially does the same thing, except it loops/restarts instead of recursing. Note that this will match not only compound words, but also words that exactly match a word from the list (a compound of just one word).

qxz
  • 3,814
  • 1
  • 14
  • 29
  • I see. When I try to test in browser, it seems to hang and page becomes unresponsive. Do you think this is because we're matching exactly a word as well? – RJK Sep 21 '16 at 01:08
  • can we do something like `if(word starts with suffix)` instead of `word = suffix` – RJK Sep 21 '16 at 01:15
  • If you're using a large enough input, it might "hang" just from the processing time. You can use [`string.startsWith(prefix)`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/startsWith) to check if a string begins with another. – qxz Sep 21 '16 at 01:19
  • I just realized that just having `cat` and `cats` in the word list complicates the algorithm, because if it matches `cat` in `catsdog`, then it'll fail the next step because `sdog` isn't in the list – qxz Sep 21 '16 at 01:23