4

I am doing a modified version of collecting word co-occurrences, so I wrote my own javascript, and I am tracking the occurrences in three objects. However, once the objects get large (~8 million, 3 million, and 172000) a function that took 5 seconds per 100000 sentences now takes minutes to do one sentence with 30 words (30 tokens). I am nowhere near my RAM cap (I have 12 more GBs of RAM it could be using, and the program is only using 2.2GB). Using Node.js v17.3.1.

Why does my function take so long when the objects get bigger (even though the sentences remain the same length)? Should I be using a different object besides Javascript's default object, or is there a way improve the speed of access and setting these objects when they are so big?

Code:

  let posCounts = {};
  let negCounts = {};
  // the number of times each word occurs
  let wordCounts = {};
  let tokens = // some function that gets tokens;
    for (let k = 0; k < tokens.length; k++) {
      // count word occurences
      if (tokens[k] in wordCounts) {
        wordCounts[tokens[k]] += 1;
      } else {
        wordCounts[tokens[k]] = 1;
      }
      for(let tok = k + 1; tok < tokens.length; tok++) {
        if (tok == k) {
          // avoid word to self cooccurrence
          // should no longer be possible
          continue;
        } else {
           // check which form of the cooccurence exists already in either count
           actual_tok = (tokens[k] + "-" + tokens[tok]);
           if(actual_tok in posCounts || actual_tok in negCounts) {
             // no-op
           } else {
             actual_tok = (tokens[tok] + "-" + tokens[k]);
           }

           // condition set before this block of code
           if(condition) {
             if (actual_tok in posCounts) {
               posCounts[actual_tok] += 1;
             } else {
               posCounts[actual_tok] = 1;
             }
           } else {
             if (actual_tok in negCounts) {
               negCounts[actual_tok] += 1;
             } else {
               negCounts[actual_tok] = 1;
             }
           }
         }
      }


    }

Update: I've tried increasing the heap size via node train_matrices.js --max-old-space-size=12288 and node train_matrices.js --max_old_space_size=12288 (underline instead of dash), and that didn't work either.

AndersonHappens
  • 507
  • 1
  • 4
  • 16
  • `and the program is only using 2.2GB).` You are approaching the default max heap size, so you're likely hitting a lot of GC hits. You can try launching your node program using the node flag `--max_old_space_size=4096` which will expand the RAM node is willing to use. They don't recommend exceeding more than 2GB though (even though some of my production servers has background node processes that uses >= 6GB sometimes without much issue) – Norman Breau Feb 09 '22 at 06:32
  • 2
    Perhaps a `Map` object would be more appropriate. I have read that it handles dynamically inserted keys better than a plain object. – jfriend00 Feb 09 '22 at 06:32

2 Answers2

1

Probably not the main issue in your code, but you can reduce the number of lookups by changing this structure from this:

  if (tokens[k] in wordCounts) {
    wordCounts[tokens[k]] += 1;
  } else {
    wordCounts[tokens[k]] = 1;
  }

to this:

  let token = tokens[k];
  let cnt = wordCounts[token] || 0;
  wordCounts[token] = cnt + 1;

And, as I said in a comment, I've read that a Map object with .get() and .set() is better suited when there are lots of dynamically created keys whereas plain objects are better suited when you have lots of objects with all the same keys (as the JS compiler can sometimes make a C-like struct for it), but this can't be done when you're regularly adding new keys.

jfriend00
  • 683,504
  • 96
  • 985
  • 979
1

The answer was to both use the increase memory flag node <YOUR_FILE_NAME>.js --max-old-space-size=12288 and change to using a Map instead of an object - thanks to @jfriend00 and @Norman Breau for the suggestions. That said, maps have a max capacity of 2^24 items or 1 GB, so I ended up using a modified version of the BigMap from this stackoverflow (modified to limit the total number of items still - ended up running completely out of RAM).

Modified code (you can replace BigMap with Map if you want):

  let posCounts = new BigMap();
  let negCounts = new BigMap();
  let wordCounts = new BigMap();
  let actual_tok;
    tokens = // some code
    // mark every cooccurrence
    for (let k = 0; k < tokens.length; k++) {
      // count word occurences
      if (wordCounts.has(tokens[k])) {
        wordCounts.set(tokens[k], wordCounts.get(tokens[k]) + 1);
      } else {
        wordCounts.set(tokens[k], 1);
      }
      for(let tok = k + 1; tok < tokens.length; tok++) {
        if (tok == k) {
          // avoid word to self cooccurrence
          // should no longer be possible
          continue;
        } else {
           // check which form of the cooccurence exists already in either count
           actual_tok = (tokens[k] + "-" + tokens[tok]);
           if(posCounts.has(actual_tok) || negCounts.has(actual_tok)) {
             // no-op
           } else {
             actual_tok = (tokens[tok] + "-" + tokens[k]);
           }

           if(condition) {
             if (posCounts.has(actual_tok)) {
               posCounts.set(actual_tok, posCounts.get(actual_tok) + 1);
             } else {
               posCounts.set(actual_tok, 1);
             }
           } else {
             if (negCounts.has(actual_tok)) {
               negCounts.set(actual_tok, negCounts.get(actual_tok) + 1);
             } else {
               negCounts.set(actual_tok, 1);
             }
           }
         }
      }


    }
  }
AndersonHappens
  • 507
  • 1
  • 4
  • 16