how to mark data quickly

Question

I have some billions data A and billions data B

if item of A in B ,mark the item "red" ,if not ,mark it "blue"

I can think out a very slowly function like this:

var A=[10000000]
,B=[1000000];
for (var m = 0; m < A.length; m++) {
              
  var isInB = false;
  for (var n = 0; n < B.length; n++) {
    if (B[n].id ==A[m].id) {
      isInB = true;
      break;
    }
  }
  
  A[m].color=isInB?"red":"blue";
               
}

Are you running this in the browser? reason you are not processing this on the server? — epascarello, Jul 15 '16 at 12:10
`A = [10000000]` creates an array with one element, perhaps not what you want. Also, does "item of A in B" mean that the same object(?) is stored both in A and B? Or a different object with the same id? — le_m, Jul 15 '16 at 12:21
If the smaller of A and B is small enough to fit inside a hashtable in memory, then probably the fastest way is to load that smallest dataset into a hashtable and just loop through each item in the other one, testing whether it exists in the hashtable (the value attached to the key doesn't matter). If both are too big to fit in memory, sort each of them (as @NinaScholz alluded to) and then you can merge the two sorted lists in linear time to find all duplicates. (Google that for more info.) — j_random_hacker, Jul 15 '16 at 12:24
To "mark" objects, you could use a `WeakMap`, e. g. 'var isInB = new WeakMap()`. A weak map only holds 'weak' references to the stored objects. This way, when you delete the object itself (you are no longer referencing to the object from anywhere), the garbage collector can safely remove it - ignoring the weak reference from your weak map. — le_m, Jul 15 '16 at 12:26
i turn one of the array to a map, i cost almost 100 times fast from 4000ms to 40ms. — user5088152, Jul 15 '16 at 12:47

score 2 · Answer 1 · edited May 23 '17 at 12:08

You could use an temporary set and then perform a test on that. Here is an ES6 implementation for that:

// sample data: primes (A) and Fibonacci numbers (B)
var A = [{id: 1}, {id: 2}, {id: 3}, {id: 5}, {id: 7}, {id: 11}, {id: 13}, {id: 17},
         {id: 19}, {id: 23}];
var B = [{id: 1}, {id: 2}, {id: 3}, {id: 5}, {id: 8}, {id: 13}, {id: 21}, {id: 34}];

// Create a set with all ID values that exist in B:
var bSet = new Set(B.map(b => b.id));
// Enrich A with color property based on that set:
A.forEach(a => a.color = bSet.has(a.id) ? 'red' : 'blue');

console.log(A);

As this is based on a set, there is no need to first sort the data.

Performance

In comparing algorithms I will ignore the time spent on creating the color property, as both algorithms have to do that for all elements of A.

The original algorithm has a time complexity of O(n.m), where n and m are the number of elements in A and B respectively.

Using a set for this gives a performance increase compared to the original algorithm. Many JavaScript engines implement sets with near constant insertion and lookup times (with hashes, see for example V8), although it could be O(logn) if a standard search tree is used, n being the number of elements in the set. I'll take the worst case and assume O(logn) for both operations.

The above algorithm will create the set in O(m.logm) time, and then populate A with the extra attribute in O(n.logm) time.

That makes the total time complexity O((n+m)logm), which is better than O(n.m). If constant insertion and lookup times are applicable, then this reduces to a simple O(n+m) time complexity.

How does this address the performance aspect? Why and how is it faster? That should be explained in your answer. — Ingo Bürk, Jul 15 '16 at 12:47
A `Set` lookup is in O(1) so looking up all IDs is in O(n), much faster than the original O(n²). If A and B both hold 'billions' of data, then the memory footprint is already way beyond 10 GB and another Set of a few GB shouldn't matter. So while being a bit short on explaining things, this answer is still correct. — le_m, Jul 15 '16 at 13:09
I added a section on performance, being rather conservative on the performance of insertion/lookup in a `Set`. — trincot, Jul 15 '16 at 13:10

score 0 · Answer 2 · answered Jul 15 '16 at 13:22

I think @trincot's answer is correct, but I thought I would offer my own thought process to how I would solve it.

I think a good algorithm exists if we take a closer look at your problem statement:

if item of A in B ,mark the item "red" ,if not ,mark it "blue"

In pseudo code this becomes:

for(item in b):
  if(a.contains(item)){
    b.markRed();
  }else{
    b.markBlue();
  }
}

This has one loop instead of two, which means we're back O(n) realm instead of the very bad O(n^2). The question then becomes, what data structure do we use for A such that there is a "contains" method? @trincot suggests a Set but any map/dictionary implementation would also serve it's purpose.

There's the extra cost of creating the set/map/dictionary, but it's much, much less than nested loops.

So, it's faster because we replaced a loop with a constant-time lookup which means for every B, we're doing 1 operation instead of A operations.

@trincot's big-O analysis also looks pretty good if you want a more complete understanding of why its much faster.

how to mark data quickly

2 Answers2

Performance