1

What is the fastest way to remove the duplicate to ensure the UserId is unique? There is around 30 millions userId to checks.

Usage

const userIds = {}

const transform = csv.format({ headers: false }).transform((row) => {
      if (userIds[row.user_id]) {
       console.log(`Found Duplicate ${row.user_id}`);
       return false;
      } else {
        userIds[row.user_id] = 1
      }

      return row;
});

The problem is the script hangs after about 20 minutes. I am running script from CLI.

I'll-Be-Back
  • 10,530
  • 37
  • 110
  • 213
  • 1
    It might be more performant to use a [Set](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set) instead. Otherwise it isn't clear what `csv` is (an NPM package?) and whether it's meant to be run on millions of rows. – Kelvin Schoofs Jul 21 '21 at 17:28
  • If you don't care which have duplicates, then your current solution looks really good already. I'd remove the `console.log`, as that can *drastically* slow down execution time if it's being called a bunch– logs are very slow compared to the other operations in your `transform`. – zcoop98 Jul 21 '21 at 17:29
  • Via [this answer](https://stackoverflow.com/questions/1960473/get-all-unique-values-in-a-javascript-array-remove-duplicates/43046408#43046408) to [Get all unique values in a JavaScript array (remove duplicates)](https://stackoverflow.com/q/1960473), Sets are likely a good bet performance-wise too. – zcoop98 Jul 21 '21 at 17:31
  • Just out of curiosity, how long does it take to iterate over each row _without_ checking the object for dupes? – silencedogood Jul 21 '21 at 17:39

1 Answers1

2

The Set object has a much faster algorithm than that of arrays' include checks.

const userIds = new Set()

const transform = csv.format({ headers: false }).transform((row) => {

      if (userIds.has(row.user_id)) {

         console.log(`Found Duplicate ${row.user_id}`);
         return false;

      } else {

          userIds.add(row.user_id)

      }

      return row;
});
I'll-Be-Back
  • 10,530
  • 37
  • 110
  • 213
Charlie
  • 22,886
  • 11
  • 59
  • 90
  • Or just `userIds.add(row.user_id)` without if condition. – Aleksandr Smyshliaev Jul 21 '21 at 17:49
  • @AleksandrSmyshliaev This is a transform stream. It has to return the `row` or `false` . The `userIds` variable is only used for tracking the duplicates. – Charlie Jul 21 '21 at 17:51
  • One problem to be noted with this approach: [Sets have a limit of 2^24 items on some systems](https://stackoverflow.com/questions/58674238/did-my-javascript-run-out-of-asyncids-rangeerror-in-inspector-async-hook-js/60849367#60849367), including the V8 engine/ Chrome (that's a little under 17 million items); 30 million items may not work, [you'll get a `RangeError`](https://github.com/nodejs/node/issues/37320). – zcoop98 Jul 21 '21 at 18:04