Efficient way to dedupe large list of arrays

Question

I have a very large array of arrays (on the order of 960,799 entries or possibly much larger). I need to process it into a new array such that:

Each sub-array contains no duplicates.
The main array contains no duplicate sub-arrays.

The problem is that "duplicate sub-arrays" must include arrays with the same values in a different order. In other words, if I had these sub-arrays:

[[1,2,3], [1,2,3], [3,1,2]]

They would all be considered duplicates and only one would be kept (any of them, it doesn't matter; I've been just keeping the first one; it's also fine if the order of the selected sub-array doesn't actually match, i.e. if the order of elements in the sub-array changes during processing).

My attempted solution has been to map all the sub-arrays into strings based on de-duping the sub-array, sorting it, and joining it with a delimiter. Then I de-dupe that final array, then map them back to arrays with a split. It works, but the process is extremely slow. It takes over 30 seconds for a single pass, and since the array I end up processing can grow exponentially larger, this is not acceptable. I need a more efficient algorithm.

Here's the code I'm using now that's slow (ret is the input array):

const stringList = ret.map(list => {
    return [...new Set(list)].sort().join('|');
});
const hashSet = new Set(stringList);
const output = [...hashSet].map(str => str.split('|'));

Can anyone help me get the same result more efficiently? Thanks.

EDIT

To elaborate, I'm getting these massive input arrays by calculating what is essentially the power set of some input of strings. This is the code; if it's possible to stop it from producing duplicate entries in the first place, that would work well, too, I think:

// Calculate the Cartesian product of set s
function cart(s) {
    return s.reduce((acc, val) => {
        return acc.map((x, i) => {
            return val.map(y => {
                return x.concat([y]);
            });
        }).flat();
    }, [[]]);
}

// Use the Cartesian product to calculate the power set of set s
function pset(s) {
    let ret = [];
    for (let i = 0; i < s.length; ++i) {
        const temp = [];
        for (let j = 0; j <= i; ++j) {
            temp.push([].concat(s));
        }
        ret = ret.concat(cart(temp));
    }
    return ret;
}

I realize your question is about deduplication, but how are your arrays and subarrays being generated in the first place? In regards to efficiency, it would be better to generate them without duplicates in the first place if possible. — Patrick Roberts, Sep 17 '19 at 20:38
Take a look at this answer for some ideas https://stackoverflow.com/a/9229821/79790 — bpaul, Sep 17 '19 at 20:42
@PatrickRoberts They are calculated as basically the power set of an input array of values. Basically, I have a function that calculates the Cartesian product of an array of arrays. I'm running that function on increasingly longer sets of the same input array duplicated 1 to N times (N=length of the input) and concatenating the results together. This does result in a power set, but now I need to de-dupe that result, as the products from smaller iterations overlap with products from larger ones. I'll see if I can modify the power set code to prevent duplicates in the first place. — IceMetalPunk, Sep 17 '19 at 20:45
@PatrickRoberts I've added my power set calculation code into the OP. I tried modifying it to prevent duplicates from generating, and instead I ended up removing legitimate unique values from the final output; I'm not sure quite how to make this work from the generation side. — IceMetalPunk, Sep 17 '19 at 20:53
@NinaScholz The code is now in the OP. The outputs are usually large arrays, so examples on input/output are difficult, but for a small one: `pset(['A', 'B'])` produces `[['A'], ['B'], ['A', 'A'], ['B', 'B'], ['A', 'B'], ['B', 'A']]` but I need it without the duplicates, so something more like `[['A'], ['B'], ['A', 'B']]` — IceMetalPunk, Sep 17 '19 at 20:57
Having a Set available during your calculations to check for dups would definitely be more efficient than going through the whole array again — charlietfl, Sep 17 '19 at 20:57
Also, do your sub arrays contain strings or numbers? You've been fairly inconsistent in exemplifying your dataset, but this is a rather critical point from a performance perspective. — Patrick Roberts, Sep 17 '19 at 21:04
If your original problem is just to compute the powerset, how's your question different from, say, https://stackoverflow.com/questions/42773836/how-to-find-all-subsets-of-a-set-in-javascript — slider, Sep 17 '19 at 21:05
You could potentially do some dynamic programming and calculate the summation of each subset, then only check for duplicates in the subsets that have equal summations. This obviously assumes that you're always working with numeric values. — Jake Holzinger, Sep 17 '19 at 21:25
@PatrickRoberts It could be either; the type will not necessarily be known beforehand. — IceMetalPunk, Sep 18 '19 at 14:07
@slider I... suppose it's not. I was under the impression that a true power set would include those duplicate permutations such as [1,2] and [2,1]. From the output of the solutions on that question, it seems I misunderstood. I guess this can be marked as a duplicate, then. — IceMetalPunk, Sep 18 '19 at 14:08

Nina Scholz · Answer 1 · 2019-09-17T21:05:35.620

0

You could generate the power set without duplicates.

function pset(array) {
    function iter(index, temp) {
        if (index >= array.length) {
            temp.length && result.push(temp);
            return;
        }
        iter(index + 1, temp.concat(array[index]));
        iter(index + 1, temp);
    }
    var result = [];
    iter(0, []);
    return result;
}

console.log(pset(['a', 'b', 'c']));

.as-console-wrapper { max-height: 100% !important; top: 0; }

edited Sep 17 '19 at 21:05

answered Sep 17 '19 at 20:29

Nina Scholz

376,160
25
347
392

While this works, it takes exactly as long as my original solution; it doesn't seem to be any faster/more efficient :( – IceMetalPunk Sep 17 '19 at 20:40
@IceMetalPunk: this one is way faster and gives the correct output you want. ex: doing `pset(['A', 'B', 'C', 'D', 'E', 'F'])` 100 times, this one runs in 2.7ms, your's runs in 3115ms. i can even get it slightly faster by caching `array.length` – dandavis Sep 17 '19 at 22:08

Patrick Roberts · Answer 2 · 2019-09-18T15:32:44.393

Given that I'm not able to perform a benchmark with real data, I can't verify how much faster this approach is for your use case, but by using basic for loops and avoiding functional code as much as conveniently possible, I've come up with the following:

const ret = [[1, 2, 3], [1, 2, 3], [3, 1, 2], [1, 4, 5], [4, 1, 5]];

function ascending (a, b) {
  // works for strings and numbers
  return -(a < b) || +(a > b);
}

function ascending2d (a, b) {
  const aLength = a.length;
  const bLength = b.length;
  const length = Math.min(aLength, bLength);

  for (let i = 0; i < length; ++i) {
    const difference = ascending(a[i], b[i]);
    if (difference !== 0) return difference;
  }

  return aLength - bLength;
}

for (let i = 0; i < ret.length; ++i) {
  ret[i].sort(ascending);
}

ret.sort(ascending2d);

const output = [ret[0]];

for (let i = 1; i < ret.length; ++i) {
  const value = ret[i];
  if (ascending2d(ret[i - 1], value) !== 0) output.push(value);
}

console.log(output);

Let me know if this is an improvement over your current approach. You can always improve performance further by profiling your code and looking for bottlenecks that can be re-written.

Performance Benchmark

I've published a benchmark using the test data in my example here, comparing your original solution, my solution, and Andrew's solution. I couldn't include Nina's for comparison because hers doesn't perform deduplication on ret, instead it modifies the generation of ret.

Andrew · Answer 3 · 2019-09-18T16:42:19.823

0

EDIT: Nevermind, my implementation had no benchmarks. It is slower. Due to the underlying implementation of JSON.parse, JSON.stringify, and the default algorithm for Array#sort.

Since you're looking for bleeding edge performance, it's hard to get an elegant solution. If you instantiate an object with Object.create(null) you minimize the overhead for O(1) insertion. It creates a POJO with no prototype. You also don't need to check in the for in loop for Object.hasOwnProperty, because there's no prototype to search.

const ret = [[], [1, 2, 3], [3, 1, 2], [1, 4, 5], [4, 1, 5]];

const hashMap = Object.create(null)
function createUniqArraysOfPrimitiveArrays(ret) {
  for (let i = 0; i < ret.length; i++) {
    const currEl = ret[i]
    if (currEl.length === 0) {
      hashMap['[]'] = null
    } else if (currEl.length === 1) {
      hashMap[`[${currEl[0]}]`] = null
    } else {
      hashMap[JSON.stringify(currEl.sort())] = null
    }
  }
  const outputArray = []
  for (const array in hashMap) {
    outputArray.push(JSON.parse(array))
  }
  return outputArray
}

console.log(createUniqArraysOfPrimitiveArrays(ret))

edited Sep 18 '19 at 16:42

answered Sep 17 '19 at 23:31

Andrew

7,201
5
25
34

1

I don't necessarily think a solution containing `JSON.stringify()` and `JSON.parse()` screams "performance"... in addition, relying on the default sorting method will decrease performance in comparison to using a callback that simply returns the difference of the elements, because [the default sort order is built upon converting the elements into strings, then comparing their sequences of UTF-16 code units values](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort). – Patrick Roberts Sep 18 '19 at 14:56
I've confirmed my suspicions. [This benchmark](https://jsperf.com/deduplication-of-2d-array/1) suggests that your approach is actually _slower_ than the original solution, not faster. Although on Edge it seems to perform faster... but in both cases my solution performs much faster than both. – Patrick Roberts Sep 18 '19 at 15:33
@PatrickRoberts Interesting information. I'll take a closer look at your answer – Andrew Sep 18 '19 at 16:40

Efficient way to dedupe large list of arrays

3 Answers3

Performance Benchmark