This is a non-trivial problem and an opportunity to learn about designing data structures. This is a long-format answer that hopes to give you the tools to make these analyses on your own. This post uses Brandon's well-intentioned answer as the basis of contrast. First we need to examine the properties of some of JavaScript's built-in structures.
performance of array
We will start with Array. If we have an array, input
, and a known index, i
, we can fetch the element in constant time, input[i]
. This means neither the size of input
nor i
affect how long it will take to get the answer. This is denoted in Big O notation as O(1). This is true because an array is effectively a pointer to a position in memory, and i
is an offset. Finding the element is a constant computation of position + i
which is always a simple addition.
const input =
[ "foo", "bar", "cat", "dog", "hat", "ham" ]
const i =
4
const result =
input[i]
console.log(result) // "hat"
If however the index is unknown and we wish to find an element by it's value, we must scan the array one element at a time. This is known as linear complexity and denoted as O(n) because the amount of time/space scales linearly with respect to the size of the input, n
. With an array of n = 3
elements, we only have to check up to 3 elements to find a match. With an array of n = 1000
, we would have to check up to 1,000!
const input =
[ "foo", "bar", "cat", "dog", "hat", "ham" ]
const query =
"hat"
const result =
input.findIndex(v => v == query)
console.log(result) // 4
Almost every Array operation scales with the size of the input array. These linear operations include entries
, every
, fill
, filter
, find
, findIndex
, flat
, flatMap
, for..of
, forEach
, from
, includes
, indexOf
, join
, keys
, lastIndexOf
, map
, reduce
, reduceRight
, reverse
, slice
, some
, toString
, and values
. An important-to-know property of linear operations is that when they are nested, they become quadratic, or O(n2).
An easy way to see this is two nested for
loops, O(n) * O(n) = O(n*n). If n = 1000
, this could take 1,000,000 computations. This would also apply to other linear operations like includes
nested inside filter
-
const input =
[ "foo", "bar", "cat", "dog", "hat", "ham" ]
const result =
input.filter(v => v.includes("a")) // <- nested linear operations
console.log(result)
// [ "bar", "cat", "hat", "ham" ]
performance of object
Next we will look at JavaScript's Objects. Like Arrays, if we have an object, input
, and a known key, query
, we can retrieve a value in constant time, O(1) -
const input =
{ a: 1, b: 2, c: 3 }
const query =
"b"
const result =
input[b]
console.log(result) // 2
Now say we have two objects, foo
and bar
, and wish to determine if they are equals. Our function isEqual
combines Object.keys
, O(n), with every
, O(n), and nests includes
, O(n), with a property lookup, O(1). This is O(n + n * (n + 1)) which simplifies to O(n2). The includes
check here is unnecessary but we following Brandon's example for comparison -
const foo =
{ a: 1, b: 3 }
const bar =
{ b: 3, a: 1 }
const isEqual = (x,y) =>
Object.keys(x).every(k => y.includes(k) && y[k] === x[k])
console.log(isEqual(foo, bar)) // true
compound performance
For our final example, let's take an array of objects, input
, and a query
object, and try to find the query's index. We already determined isEqual
is a quadratic, O(n2). When we nest this operation inside linear findIndex
, we end up with O(n) * O(n2) = O(n3). If n = 1000
this would take 1,000,000,000 computations! As you can see, linear operations can scale in computational complexity very quickly -
const input =
[ { a: 1, b: 1 }, { a: 1, b: 2 }, { a: 1, b: 3 } ]
const query =
{ b: 2, a: 1 }
function isEqual(x, y) {
return Object.keys(x).every(k => y.includes(k) && y[k] === x[k])
}
const result =
input.findIndex(v => isEqual(v, query))
console.log(result) // 1
Now let's analyze Brandon's answer. I don't want anyone to think I'm picking on him. His program is valid and returns the correct result. However as the input size grows, its process can grind your cpu to a halt. If this program were used in a loop or happens every UI refresh, it could be a problem.
const allSources =
[source1,source2]
.flat(Infinity)
.filter((obj,_,a) =>
a.find(o =>
Object.keys(obj).length === Object.keys(o).length
&& Object.keys(obj).every(objKey =>
Object.keys(o).includes(objKey)
&& obj[objKey] === o[objKey])) === obj)
complexity |
operation |
O(1) |
.length , === , && |
O(n) |
flat ,filter ,keys ,every ,includes |
computational complexity |
O(n + n * (n * (n + 1 + 1 + n + 1 + n + n * (n + n + 1 + 1 + 1 + 1)))) |
= O(n + n * (n * (n + 1 + 1 + n + 1 + n + n * (n)))) |
= O(n + n * (n * (n + 1 + 1 + n + 1 + n + n2))) |
= O(n + n * (n3)) |
= O(n4) |
my first data structure
Above the filter(...)
algorithm has O(n4) complexity. For an input size of n = 1000
, that is an estimated 1,000,000,000,000 (one trillion) operations! If we want a faster solution, we cannot rely on Array's linear operations, but JavaScript does not include a data type that does exactly what we need. Maybe you never designed your own data structure before, maybe you didn't know that was a possibility, that's okay!
We will now look at Map which has very similar properties to Object. The primary difference is that an object's keys must be a string whereas Map can associate any key type with any value. And like Array and Object, we can nest Map any amount of levels, if desired. We will use get
sequenced with get
, to lookup a nested value in constant time. This is denoted as O(1) + O(1) = O(2) and simplified to O(1) -
const key1 =
{ a: "ay" } // <- any type for key1
const key2 =
[ "bee" ] // <- any type for key2
const someValue =
Symbol("ess") // <- any value
const m =
new Map([[ key1, new Map([[ key2, someValue ]]) ]])
const result =
m.get(key1).get(key2)
console.log(result === someValue) // true
In addition to get
, Map offers has
which answers whether our particular map, m
, has a particular key, k
-
const key1 =
{ a: "ay" }
const m =
new Map([[ key1, "foo" ]])
console.log(m.has(key1)) // true
console.log(m.has("bar")) // false
recmap
We can use the high-performance properties of Map to write our own Recursive Map type, recmap
. Before we dive into the implementation, here's how we want it to work -
import { empty, get, set } from "./recmap.js"
const t = empty() // <- empty recmap
set(t, ["a", 1, "b", 2], "hello") // <- set a deep value
const result =
get(t, ["a", 1, "b", 2]) // <- retrieve a deep value
console.log(result) // "hello"
We can visualize t
above as a recursive sequence of nested Maps -
Map { "a" => Map { 1 => Map { "b" => Map { 2 => "hello" }}}}
The user does not need to understand Array, Object, or Map to use it effectively. This is known as data abstraction and it's an important property of writing effective data structures. We will build our data type so that the nested structure is automatically managed by our module's operations -
// recmap.js
const empty = () =>
new Map // <- representation of "empty" recmap
const has = (t, [ k, ...ks ]) => // <- recursive wrapper around Map.has
ks.length == 0
? t.has(k)
: t.has(k)
? has(t.get(k), ks)
: false
const set = (t, [ k, ...ks ], v) => // <- recursive wrapper around Map.set
ks.length == 0
? t.set(k, v)
: t.has(k)
? (set(t.get(k), ks, v), t)
: t.set(k, set(empty(), ks, v))
const get = (t, [ k, ...ks ]) => // <- recursive wrapper around Map.get
ks.length == 0
? t.get(k)
: t.has(k)
? get(t.get(k), ks)
: undefined
export { empty, has, set, get } // <- public module interface
Our recmap
is extremely fast, able to filter a million values in less than one second. This is great, but there's on small problem: the purpose of this question is to compare objects to one another.
my second data structure
Hang in there, we are almost done. We designed recmap
around Map so that the user doesn't have to concern him/herself with managing a nested structure. We will build unique
around recmap
so the user is not concerned with neither nested Maps nor key sequences. Instead we can simply create an empty
collection and freely add
as many values as we want -
import { empty, has, add, values } from "./unique.js"
const t = empty() // <- empty unique
add(t, { a:1, b:2, c:3 }) // <- any object
const result =
has(t, { a:1, b:2, c:3 }) // <- has any object?
console.log(result) // true
It doesn't matter which order the input object's keys are in. We will always get the correct answer -
console.log(has(t, { b:2, c:3, a:1 })) // true
It doesn't matter if we try to add
the same object with different key orderings. It will only be added once -
add(t, { b:2, c:3, a:1 })
add(t, { b:2, a:1, c:3 })
add(t, { c:3, a:1, b:2 })
console.log(Array.from(values(t)))
[ { c:3, a:1, b:2 } ] // <- only one distinct object is added
unique
This module is much easier to write thanks to our existing recmap
. It is a simple wrapper around recmap.set
and recmap.has
with the addition of a toKey
function which takes care of sequencing the input object. This technique of building complex data structures out of simpler ones is one every functional programmer must master -
// unique.js
import * as recmap from "./recmap.js"
const add = (t, v) => // <- wrapper around recmap.set
recmap.set(t, toKey(v), v)
const has = (t, v) => // <- wrapper around recmap.has
recmap.has(t, toKey(v))
const toKey = v => // <- private helper, sequences an object
Object
.entries(v)
.sort((a,b) => a[0].localeCompare(b[0]))
.flat()
const empty =
recmap.empty // <- same empty representation as recmap
const values =
recmap.values // <- todo
export { add, empty, has, values } // <- public interface
Above notice we did not export toKey
. This is a private function only used by our module; an implementation detail the user does not need to worry about. Also note we called recmap.values
into existence, we will implement that now -
// recmap.js (continued)
const empty = //...
const has = //...
const set = //...
const get = //...
function* values (t)
{ if (t instanceof Map)
for (const v of t.values())
yield *values(v)
else
yield t.unwrap
}
export
{ empty, has, set, get, values } // <- update export
benchmark
We've come along way. It's now time to measure the difference between our unique
data structure and the linear filter
algorithm proposed in Brandon's answer. To compare the two, we will use randCoord
which generates random { x, y }
objects -
// fixtures
const rand = x =>
Math.floor(Math.random() * x)
const randCoord = size =>
({ x: rand(size), y: rand(size) })
const input =
Array.from(Array(count), _ => randCoord(size))
The unique
algorithm iterates through the input
and add
s each element to our collection -
console.time("unique")
const a = empty()
for (const x of input)
add(a, x)
console.timeEnd("unique")
console.log(`uniques: ${Array.from(values(a)).length}`)
The filter
algorithm is a polyadic chain of filter
-> find
-> every
-> includes
, as described above -
console.time("filter")
const b = filter(input)
console.timeEnd("filter")
console.log(`uniques: ${b.length}`)
With a size of randCoord(10)
, we generate coordinates from (0,0)
to (10,10)
with a maximum possibility of 10 x 10 = 100 unique coordinates -
count |
uniques |
algorithm |
runtime |
each |
100 |
64 |
filter |
2.26 |
0.023 |
1,000 |
100 |
filter |
11.30 |
0.011 |
10,000 |
100 |
filter |
96.73 |
0.010 |
100,000 |
100 |
filter |
802.52 |
0.008 |
In this case, the linear filter
algorith never has to check more than 100 elements to determine if one is unique. So even with a high count of 100,000 elements, duplicates can be detected somewhat easily. The unique
data structure however detects duplicates faster -
count |
uniques |
algorithm |
runtime |
each |
100 |
64 |
unique |
9.68 |
0.097 |
1,000 |
100 |
unique |
3.88 |
0.004 |
10,000 |
100 |
unique |
12.16 |
0.001 |
100,000 |
100 |
unique |
106.31 |
0.001 |
Your real-world data is using { name, age }
objects and the unique combinations is probably much greater than 10 x 10. Next we'll see how as the possible uniques increases, the performance of filter
hurts more and more. With a size of randCoord(100)
, we generate coordinates from (0,0)
to (100,100)
with a maximum possibility of 100 x 100 = 10,000 unique coordinates -
count |
uniques |
algorithm |
runtime |
each |
100 |
98 |
filter |
2.52 |
0.025 |
1,000 |
946 |
filter |
48.63 |
0.049 |
10,000 |
6,340 |
filter |
3,105.00 |
0.310 |
100,000 |
10,000 |
filter |
67,460.00 |
0.675 |
Here filter
starts to rear its head. At a modest 1,000 elements, the runtime is already 4x greater. At 10,000, we see a 32x increase. By contrast, unique
handles the larger input space with effectively no increase in runtime -
count |
uniques |
algorithm |
runtime |
each |
100 |
98 |
unique |
9.76 |
0.098 |
1,000 |
946 |
unique |
4.11 |
0.004 |
10,000 |
6,340 |
unique |
13.10 |
0.001 |
100,000 |
10,000 |
unique |
121.55 |
0.001 |
With a size of randCoord(1000)
, we generate coordinates from (0,0)
to (1000,1000)
with a maximum possibility of 1,000 x 1,000 = 1,000,000 unique coordinates. This is similar to a { first, last }
object where there could be 1,000 unique first names and 1,000 unique last names -
count |
uniques |
algorithm |
runtime |
each |
100 |
100 |
filter |
0.39 |
0.004 |
1,000 |
999 |
filter |
38.86 |
0.039 |
10,000 |
9,950 |
filter |
3,710.00 |
0.371 |
100,000 |
95,186 |
filter |
364,549.00 |
3.645 |
The noteworthy pattern we need to point out is that as the number of uniques increases, it takes filter
longer to find each result. With a large count of 100,000 elements to check, it takes over 6 minutes to complete. Meanwhile unique
computes the result with no increase in runtime as the number of uniques increases -
count |
uniques |
algorithm |
runtime |
each |
100 |
100 |
unique |
0.12 |
0.001 |
1,000 |
999 |
unique |
0.97 |
0.001 |
10,000 |
9,950 |
unique |
11.04 |
0.001 |
100,000 |
95,186 |
unique |
124.99 |
0.001 |
real-world example
Above we compared objects with just two fields. In a final benchmark, let's see how the two algorithms compare when dealing with an object with seven (7) fields first
, last
, age
, phone
, gender
, and email
-
// fixtures
const rand = x =>
Math.floor(Math.random() * x)
const randObj = () => ({
first: rand(1000),
last: rand(1000),
age: rand(100),
phone: rand(1e10),
gender: rand(2),
email: rand(1e20)
})
const input =
Array.from(Array(count), randObj)
This is equal to 1000 x 1000 x 100 x 10,000,000,000 x 2 x 100,000,000,000,000,000,000 = 200,000,000,000,000,000,000,000,000,000,000,000,000, or 2e38, possible uniques -
count |
uniques |
algorithm |
runtime |
each |
100 |
100 |
filter |
1.95 |
0.019 |
1,000 |
1,000 |
filter |
53.47 |
0.053 |
10,000 |
10,000 |
filter |
4,435.00 |
0.444 |
100,000 |
100,000 |
filter |
471,622.00 |
4.716 |
1,000,000 |
1,000,000 |
filter |
? |
? |
To no surprise, not a single duplicate found, even in 100,000 objects. This demonstrates the worst-case scenario for both algorithms. I threw in a large benchmark of 1,000,000 elements just to stress-test the unique
algorithm. While it would take filter
over 12 hours to complete, unique
determines the result in less than 5 seconds -
count |
uniques |
algorithm |
runtime |
each |
100 |
100 |
unique |
11.49 |
0.115 |
1,000 |
1,000 |
unique |
6.55 |
0.007 |
10,000 |
10,000 |
unique |
47.77 |
0.005 |
100,000 |
100,000 |
unique |
404.93 |
0.004 |
1,000,000 |
1,000,000 |
unique |
4,956.00 |
0.005 |
enhancement 1
A big advantage of writing our own data structures is we can change their behaviour and tune their performance as needed. If we add an object and attempt a lookup of using a subset of that object, we will get a true
answer -
import { empty, add, has } from "./unique.js"
const t = empty()
add(t, {a: 1, b: 2})
console.log(has(t, {a: 1})) // true ❌
If the underlying representation of t
is -
Map { "a" => Map { 1 => Map { "b" => Map { 2 => ... }}}}
That means has(t, {a: 1})
will find the nested map below, and return true
-
{ "b" => Map { 2 => ... }}
You might think this is a problem with unique
but the issue exists with recmap
as well. This is a bug and we can fix it by introducing a simple checksum
-
// recmap.js (continued)
const checksum = key =>
[ key.length, ...key ] // <- prepend sequence with length
const empty = // ...
const _get = // ... originally `get`, now private
const _has = // ... originally `has`, now private
const _set = // ... originally `set`, now private
// new wrappers: checksum the key and call private function
const get = (t, k) => _get(t, checksum(k))
const has = (t, k) => _has(t, checksum(k))
const set = (t, k, v) => _set(t, checksum(k), v)
export // ...
No changes are necessary for unique
. We're back in business -
add(t, {a: 1, b: 2})
console.log(has(t, {a: 1})) // false ✅
console.log(has(t, {a: 1, b: 2})) // true ✅
console.log(has(t, {b: 2, a: 1})) // true ✅
enhancement 2
Our recmap
module uses Map
as the underlying representation but the user doesn't (shouldn't) know that. What happens if the user attempts to insert a Map
of their own into the mix?Introducing a wrap
utility, we can allow the user to insert any value type and prevent the leak of any implementation details -
// recmap.js (continued)
const checksum = // ...
const empty = // ...
const _get = // ...
const _has = // ...
const _set = // ...
const wrap = v => // <- wrap utility
({ unwrap: v })
const has = // ...
const get = (t, k) => _get(t, checksum(k)).unwrap // <- unwrap
const set = (t, k, v) => _set(t, checksum(k), wrap(v)) // <- wrap
export // ...
enhancement 3
We designed recmap
around native Map which offers a few other useful methods. We can add entries
and keys
to offer a more complete implementation -
// recmap.js (continued)
// ...
function* entries (t, path = [])
{ if (t instanceof Map)
for (const [k, v] of t.entries())
yield *entries(v, [...path, k])
else
yield [path.slice(1), t.unwrap ]
}
function* keys (t, path = [])
{ if (t instanceof Map)
for (const [k, v] of t.entries())
yield *keys(v, [...path, k])
else
yield path.slice(1)
}
export { ..., entries, keys } // <- update export
demo
An answer without a functioning demo is incomplete, imo. I've inlined the recmap
and unique
modules because we are limited to just one "file" for demonstration. We will borrow the input examples from Brandon's post -
const source1 =
[ {age: 23, name: 'john'}
, {age: 24, name: 'jane'}
, {age: 25, name: 'sam'}
, {age: 26, name: 'smith'}
]
const source2 =
[ {age: 23, name: 'john'}
, {age: 24, name: 'jane'}
, {age: 25, name: 'sam'}
, {age: 26, name: 'smith'}
, {age: 27, name: 'mike'}
, {age: 28, name: 'joanne'}
, {age: 29, name: 'lois'}
, {age: 30, name: 'paul'}
]
const recmap = (_ => { // recmap module
const empty = () => new Map
const checksum = key => [ key.length, ...key ]
const wrap = v => ({ unwrap: v })
const _has = (t, [ k, ...ks ]) => ks.length == 0 ? t.has(k) : t.has(k) ? _has(t.get(k), ks) : false
const _set = (t, [ k, ...ks ], v) => ks.length == 0 ? t.set(k, v) : t.has(k) ? (_set(t.get(k), ks, v), t) : t.set(k, _set(empty(), ks, v))
const _get = (t, [ k, ...ks ]) => ks.length == 0 ? t.get(k) : t.has(k) ? _get(t.get(k), ks) : undefined
const has = (t, keys) => _has(t, checksum(keys))
const set = (t, keys, v) => _set(t, checksum(keys), wrap(v))
const get = (t, keys) => _get(t, checksum(keys))?.unwrap
function* values (t) { if (t instanceof Map) for (const v of t.values()) yield *values(v); else yield t.unwrap }
return { empty, get, has, set, values }
})()
const unique = (_ => { // unique module
const toKey = v => Object.entries(v).sort((a,b) => a[0].localeCompare(b[0])).flat()
const add = (t, v) => recmap.set(t, toKey(v), v)
const has = (t, v) => recmap.has(t, toKey(v))
const empty = recmap.empty
const values = recmap.values
return { add, empty, has, values }
})()
const source1 = [{age: 23, name: 'john'}, {age: 24, name: 'jane'}, {age: 25, name: 'sam'}, {age: 26, name: 'smith'}]
const source2 = [{age: 23, name: 'john'}, {age: 24, name: 'jane'}, {age: 25, name: 'sam'}, {age: 26, name: 'smith'}, {age: 27, name: 'mike'}, {age: 28, name: 'joanne'}, {age: 29, name: 'lois'}, {age: 30, name: 'paul'}]
const result = unique.empty()
for (const x of [...source1, ...source2])
unique.add(result, x)
console.log(Array.from(unique.values(result)))
All duplicates are removed -
[ { age: 23, name: 'john' }
, { age: 24, name: 'jane' }
, { age: 25, name: 'sam' }
, { age: 26, name: 'smith' }
, { age: 27, name: 'mike' }
, { age: 28, name: 'joanne' }
, { age: 29, name: 'lois' }
, { age: 30, name: 'paul' }
]
${e.name} is ${e.age} years old
`);``` etc? And if you need sorting, remember to pass a sorting function to `sort()` so that you can sort on name or age or any other object property you need. – Mike 'Pomax' Kamermans Apr 16 '21 at 19:02