5

I have two arrays with user id's and I want to check different items in them.

arr1 = [123, 456, 789];
arr2 = [123, 456, 789, 098];

The problem is: these arrays can have 10 or 20 millions of items.

I'm trying with underscore.difference() but it took 10 mins to complete.

Is there any faster way to do this?

user2864740
  • 60,010
  • 15
  • 145
  • 220
user3175226
  • 3,579
  • 7
  • 28
  • 47
  • Are the arrays sorted? – Cameron May 01 '14 at 18:20
  • 2
    if your elements work as object keys, it would probably be quicker to turn one array's elements into object keys with a value of 1, then filter your second array according to the object having or not having the elmement as a key. sorting will be slow with that many items, but object lookup is very fast and avoids the loop-de-loop that underscore's contains() method utilizes. – dandavis May 01 '14 at 18:22
  • 2
    Oh, it looks like `_.difference` doesn't even use a set .. so using a set/probe will be *much* better bounds. See http://stackoverflow.com/questions/13147278/using-underscores-difference-method-on-objects – user2864740 May 01 '14 at 18:23
  • 3
    If you're working with 10 million items then maybe you should put your data in a database. – mu is too short May 01 '14 at 18:49

4 Answers4

3

Use native js, rather than a library that tries to accommodate lots of scenarios / inputs.

  • Skip all the function calls and scope changes.
  • Minimize property lookup/namespace walking
  • Don't keep getting the array length on each loop

Simple optimization:

var array1 = [];
var array2 = [];

var difference = [];

for(var i = 0; len = array1.length; i < len; i++)
{
    var value = array1[i];

    if(value == array2[i])
    {
        continue;
    }

    if(array2.indexOf(value) == -1)
    {
        difference.push(value);
    }
}
Metalstorm
  • 2,940
  • 3
  • 26
  • 22
3

How about converting the arrays to object to reduce the sorting complexity:

var arr1 = [123, 456, 789], arr2 = [123, 456, 789, 098];

function toObject(arr){
    return arr.reduce(function(o, v, i) {
      o[v] = i;
      return o;
    }, {});
}

var o1 = toObject(arr1), o2 = toObject(arr2), diff = [];

for(var prop in o2){
    if(o1[prop] === undefined)
        diff.push(prop);
}

console.log(diff);

You would obviously need to start on the biggest set.

http://jsfiddle.net/sHUw5/

Another thing to consider is to sort your collections and do a binary search to reduce the complexity from (O)N to (O)log2N for each array (if I'm thinking straight).

Johan
  • 35,120
  • 54
  • 178
  • 293
  • 1
    There's no need to convert `arr2` unless you have reason to believe that property iteration is significantly faster than array traversing (it probably isn't). – Aaron Dufour May 01 '14 at 18:40
  • Now you're iterating twice, and a for-in loop is generally slower than a regular for loop ? – adeneo May 01 '14 at 18:41
  • Change the function call `hasOwnProperty` to `if(!o1[prop])`. Use `if(o1[prop] === undefined)` if the dataset contains 0, might actually be faster with `undefined` as it doesn't have to do any casting. – Metalstorm May 01 '14 at 18:44
  • the in operator will be faster than hasOwnProperty(), maybe even faster than comparing or casting... – dandavis May 01 '14 at 18:45
  • This answer also excludes items that are in `arr1` and not in `arr2`. – cookie monster May 01 '14 at 18:45
  • @adeneo I guess you're right about for..in vs a "normal" for. But regarding the double iteration, you would always need to spend some performance when converting a collection to a "set". In OP's case it's probably worth it considering the size. – Johan May 01 '14 at 18:49
  • @Johan I don't see where that jsperf is relevant to what I said. I'm saying that you want to `for(var i = 0;i < arr2.length;i++)` instead of the `for...in...`. That would also save you from the extra object conversion. – Aaron Dufour May 01 '14 at 19:10
  • And to add to that, `for (var i=arr.length; i--;)` is faster as it doesn't check the length or the iterator on each iteration. – adeneo May 01 '14 at 19:26
  • @AaronDufour Sorry, a missunderstanding from my side then¨ – Johan May 01 '14 at 19:53
  • @adeneo Don't over-optimize in javascript; the runtime will easily take care of things like that. For example, your version actually becomes slower than the more normal form for arrays longer than a couple hundred million elements. For smaller arrays, you'll see a gain of about 3% of the loop overhead, which will be dominated by the contents of the loop in pretty much all cases. – Aaron Dufour May 01 '14 at 21:31
  • @AaronDufour - Writing efficient loops is not over-optimizing, it's good practice, and where's the proof that a reversed loop is slower than a regular loop, they should be the same, the advantage of writing the reversed version is that it's short to write and caches everything automatically, so when order is not important it's a good way to write a loop. Show a link or jsperf that can successfully prove that iterating in reverse is slower. – adeneo May 01 '14 at 21:40
  • @adeneo I mean, I ran a couple of straightforward tests in node, since this is a question about node. Feel free to test it yourself as well. It is over-optimizing if you do it without realizing that its at best immeasurably faster, unfamiliar and thus more difficult to read, and traverses the array in reverse. – Aaron Dufour May 01 '14 at 21:45
  • @AaronDufour - It shouldn't really be unfamiliar, the way to iterate something as basic as a live nodeList is with a loop exactly like that, in reverse, as regular loops will cause issues when the live nodeList is changed, so it should be something someone with javascript experience has seen many many times before, and it's also regarded as the fastest way to loop, only beaten by a reversed while loop in some cases. – adeneo May 01 '14 at 21:49
2

This considers you do not have the numbers 0 or 1 in the arrays:

var arr1 = [123, 456, 789,3],
    arr2 = [123, 456, 789, 098],    
    has = {},
    different=[],
    length1=arr1.length,
    length2=arr2.length;



for(var i=0;i<length1;i++){
        has[arr1[i]]=true;
}
for(var i=0;i<length2;i++){
    var val=arr2[i];
    if(has[val] === undefined){
       has[val]=val;
    }
    else{
        if(has[val]!=val){
            has[val]=false;
        }
   }
}
for(var i in has){
    if (has[i]) different.push(i);
}

If you want to check also for 0 and 1:

for(var i=0;i<length1;i++){
    has[arr1[i]]=NaN;
}
for(var i=0;i<length2;i++){
    var val=arr2[i];
    if(has[val] === undefined){
        has[val]=null;
    }
    else{
        if(has[val]!=null){
            has[val]=true;
        }
    }
}
for(var i in has){
    if (!has[i]) different.push(i);
}
juvian
  • 15,875
  • 2
  • 37
  • 38
0

here is a fast way of cheating the nested iteration that _.difference will invoke:

var arr1 = [123, 456, 789],
    arr2 = [123, 456, 789, 098],    
     has = {};

arr1.forEach(function(a){ this[a]=1;}, has);
alert(  arr2.filter(function(a){return !this[a]; }, has)   );

by using this in the iteration, we hand a pure function to JS that can be executed at the maximum possible speed.

note that this won't work for arrays of objects, or mixed-type arrays like [1, "1"], but it should work for the problem described and demonstrated by the question.

edit: it you want bi-directional compares (like arr1 having, arr2 missing or vice-versa), reverse and repeat the code above. you'll still only be at 40million computations compared to 100trillion that an indexOf()-using method will cost...

dandavis
  • 16,370
  • 5
  • 40
  • 36
  • 1
    It will be a lot faster if you write the loops by hand – Esailija May 01 '14 at 18:28
  • You're going to exclude items in `arr1` that are not in `arr2`. – cookie monster May 01 '14 at 18:30
  • i think switching the n*n comparisions for n*2 comparisions will save far more time than un-function-ing the loop, but i'll concede the point. I was trying to keep it within the same spirit as underscore. you can adjust the logic to find diffs, repeats, uniques, etc... the main point is replacing [].indexOf(n) for {}[n], ahuge perf win. – dandavis May 01 '14 at 18:30
  • `this` is going to be `global` in that case, which makes it definitely not a pure function, plus you're polluting the global namespace. – Aaron Dufour May 01 '14 at 18:39
  • @AaronDufour: He's setting the `this` value to the `has` object. – cookie monster May 01 '14 at 18:40
  • Oh, I've never seen a second argument to `forEach`. Interesting. – Aaron Dufour May 01 '14 at 18:42
  • That's such a horrific abuse of `this` that I can't say I've ever seen it before. On-topic, making it a pure function doesn't actually make things any faster, so I'm not sure why you're so adamant about that. The non-pure version runs about 3% (i.e. insignificantly) faster on my machine (node 0.10.24). – Aaron Dufour May 01 '14 at 18:49
  • 1
    you're supposed to use "this" in functions, what makes it "horrific abuse"? i've always thought it was the best part of the [] methods and lamented that reduce doesn't allow the same option, although, it's not as important for reduction as it is for filtering. – dandavis May 01 '14 at 18:56
  • @AaronDufour: There's nothing horrific about it, but it's perhaps a little bit odd in this context since the variable is right there. It's very useful when you have a set of reusable functions that won't have access to the desired object, like: `arr1.forEach(item_to_this_object_key, has)` – cookie monster May 01 '14 at 19:05
  • My point is that should be a second argument. Using `this` as a general-purpose argument completely undermines the semantics it is supposed to imply. – Aaron Dufour May 01 '14 at 19:13
  • @AaronDufour: if only we could define our own list of run-time arguments for such functions, but map/filter/forEach/etc all apply(caller.arguments[1], [element,index,collection]) to the iteration function. you can use bind to curry, but that makes your re-usable functions less readable... in close, consider function gt(n){return n>this;}, and how it can compare to any number as needed, instead of hard-coding a threshold. ex: r.filter(gt,5), or r2.filter(gt, 10)... Or how function pluck(a){return a[this];} can pluck any property from an array of objects, not just one. – dandavis May 01 '14 at 19:45