2

I've got a long array of objects (>10_000) with duplicate objects I would like to remove.

In order to locate a duplicate I've got to look at two object properties: a, b

There are some elegant answers for removing objects by one property here: JavaScript: Remove duplicates of objects sharing same property value

e.g.

const uniq = _.uniq(arr, ele => ele.value}); 

Here is what the output of a solution would look like:

const arr = [{a:1, b:1}, {a:1, b:1}, {a:2, b:2}];
const removeDuplcatesByTwoKeys = (arr, ['a', 'b']) => // only elements that are duplicates for both key values;
result: const arr = [{a:2, b:2}];

I've tried _.uniq(arr, ele => ele.value && ele.otherValue}); but this does not work.

Another approach would be creating a map of the existing values keyed by those values e.g.

function unique(arr, keyProps) {
    let map = new Map();
    const kvArray = arr.map(entry => {
        return keyProps.map(k => entry[k]).join('|');
    })
    kvArray.map(kv => {
        if(map.has(kv)) {
            const val = map.get(kv)
            map.set(kv, val + 1)
        } else {
            map.set(kv, 1)
        }
    })
}

Though this would tell you what the duplicates are, what is the best way to remove them from the original array? This feels like a solution that is more complicated than it needs to be.

What is a performant way to remove duplicates by two properties from an array of objects?

zero_cool
  • 3,960
  • 5
  • 39
  • 54
  • The suggested solution will work for an array of integers, but this is an array of objects, and the duplicates need to be removed based on two properties being equal between objects in the array. – zero_cool Jul 12 '18 at 21:30
  • 1
    Where do you get this array from? Because the typical solution in cases like these is "don't use an array, use a data structure that is designed to make this a quick and performant operation", so some kind of list manager with B-trees for the properties you're going to be filtering on, so you can filter on insert, as well as post-process. – Mike 'Pomax' Kamermans Jul 12 '18 at 21:30
  • @Mike'Pomax'Kamermans - this is data that comes from a user uploaded CSV file. The API expects an array of objects. I suspect there is a way to achieve this without changing the API. – zero_cool Jul 12 '18 at 21:32
  • 1
    Possible duplicate of [Remove duplicates from an array of objects in JavaScript](https://stackoverflow.com/questions/2218999/remove-duplicates-from-an-array-of-objects-in-javascript) – Heretic Monkey Jul 12 '18 at 21:39

3 Answers3

2

You could use _.uniq with both properties as a JSON string. This way each element can be compared with the others through a uniform system.

For example,

const arr = [{a:1, b:1}, {a:1, b:1}, {a:2, b:2}];
const removeDuplcatesByTwoKeys = _.uniq(arr, el => JSON.stringify({a: el.a, b: el.b}));

console.log(removeDuplcatesByTwoKeys)
<script src="https://cdnjs.cloudflare.com/ajax/libs/underscore.js/1.9.1/underscore-min.js"></script>
Marco
  • 2,004
  • 15
  • 24
  • What are potential pitfalls of this approach? I think the data is uniform so key order should not matter. – zero_cool Jul 12 '18 at 21:35
  • 1
    @zero_cool The only pitfall I can think of is if the key order is different between elements, but other than that I think this is a simple way of accomplishing your goal. – Marco Jul 12 '18 at 21:39
0

This can also be solved via:

   const removeDuplcatesByTwoKeys = array.filter((val, index) => {
     return array.findIndex((row) => (
       row.a === val.a && row.b === val.b
     ))
   })

I've read that findIndex is not performant with large arrays but not 100% on that. This method would allow you to check against as many keys as needed and not be concerned with order.

zero_cool
  • 3,960
  • 5
  • 39
  • 54
  • How findIndex works behind the scenes is that it walks through the array until the function that you've provided returns true. It is basically a for loop rendering it a O(n) function. This solution works perfectly well but will slow down as your dataset grows as it is a O(n^2) solution – Nick Abbott Jul 12 '18 at 22:04
0

One issue with arrays is the O(n) lookup time. Big O Notiation, There is simply no way around that. My first suggestion here would be to look into other methods of storing the data with a O(1) lookup time. In JavaScript your go-to solutions would be using a Map, a Set, or a simple JavaScript Object. Your choice here really depends on your needs.

A Map is a Key Value Pair system. Thus will allow you to set and get a value by a key. This is very similar to a JavaScript object. The primary differences are that a Map is ordered so it may be iterated upon with a guarantee that the result will be ordered by insertion time. Also, a Map's key may be any datatype whereas a JavaScript object may only have a string.

A Set is basically a O(1) lookup array. The limitation here is that you cannot have duplicate values though it is still ordered by insertion time.

If you don't have control over how you receive the data this actually become quite a common interview question. While solving this problem is easy the true challenge comes in solving it in a performant way. The general accepted solution is O(n). You will simply iterate over the array and add either the value or an identifying feature to a Set. When you come across a value that is already in the set you may skip it. At the end of one iteration through the array you will have all unique values. There is simply no way for a catch-all algorithm to solve this problem faster.

For your particular issue I may suggest using a map so that you can use the stringified value of the object as a key. You may also use a set and just parse the JSON when you wish to use the object. A third, and probably ideal solution, is possible if an object contains a unique value such as an id. In this case you may just use this id as the key in the array. This would prevent issues with object property ordering.

const arr = [{a:1, b:1}, {a:1, b:1}, {a:2, b:2}];
const map = new Map();

arr.forEach((val) => {
  const stringified = JSON.stringify(val);
  if (!map.has(stringified)) {
    map.set(stringified, val);
  }
});

console.log(map.values()); // MapIterator { { a: 1, b: 1 }, { a: 2, b: 2 } }

I would hesitate to use this solution in browsers as I'm not sure of the adoption of recent features such as maps and sets however in node.js this would be the most performant way of doing it.

Nick Abbott
  • 364
  • 1
  • 9