Don't know what it is about this question, and sure to get no real love from the response but I just could not let it go and get to sleep without resolving.
So the first thing to say is I think I owe the OP here $10, because my expected results are not the case.
The basic idea presented here is a comparison of:
Using parallel execution of queries to find the "maximum" ( sorted total value ) af a field and also the minimum value by the same constraint
The aggregation framework $max
and $min
grouping accumulators over the whole collection.
In "theory" these two options are doing exactly the same thing. And in "theory" even though parallel execution can happen "over the wire" with simultaneous requests to the server, there still should be an "overhead" inherrent in those requests and the "aggregation" function in the client to bring both results together.
The tests here run a "series" execution of creating random data of a reasonable key length, the to be "fair" in comparison the "key" data here is also indexed.
The next "fairness" stage is to "warm up" the data, by doing a sequential "fetch" on all items, to simulate loading as much of the "working set" of data into memory as the client machine is capable.
Then we run each test, in comparison and series so as not to compete against eachover for resources, for either the "parallel query" case or the "aggregation" case to see the results with timers attached to the start and end of each excution.
Here is my testbed script, on the basic driver to keep thing as lean as possible ( nodejs environment considered ):
var async = require('async'),
mongodb = require('mongodb'),
MongoClient = mongodb.MongoClient;
var total = 1000000;
MongoClient.connect('mongodb://localhost/bigjunk',function(err,db) {
if (err) throw err;
var a = 10000000000000000000000;
db.collection('bigjunk',function(err,coll) {
if (err) throw err;
async.series(
[
// Clean data
function(callback) {
console.log("removing");
coll.remove({},callback);
},
// Insert data
function(callback) {
var count = 0,
bulk = coll.initializeUnorderedBulkOp();
async.whilst(
function() { return count < total },
function(callback) {
var randVal = Math.floor(Math.random(a)*a).toString(16);
//console.log(randVal);
bulk.insert({ "rand": randVal });
count++;
if ( count % 1000 == 0 ) {
if ( count % 10000 == 0 ) {
console.log("counter: %s",count); // log 10000
}
bulk.execute(function(err,res) {
bulk = coll.initializeUnorderedBulkOp();
callback();
});
} else {
callback();
}
},
callback
);
},
// index the collection
function(callback) {
console.log("indexing");
coll.createIndex({ "rand": 1 },callback);
},
// Warm up
function(callback) {
console.log("warming");
var cursor = coll.find();
cursor.on("error",function(err) {
callback(err);
});
cursor.on("data",function(data) {
// nuthin
});
cursor.on("end",function() {
callback();
});
},
/*
* *** The tests **
*/
// Parallel test
function(callback) {
console.log("parallel");
console.log(Date.now());
async.map(
[1,-1],
function(order,callback) {
coll.findOne({},{ "sort": { "rand": order } },callback);
},
function(err,result) {
console.log(Date.now());
if (err) callback(err);
console.log(result);
callback();
}
);
},
function(callback) {
console.log(Date.now());
coll.aggregate(
{ "$group": {
"_id": null,
"min": { "$min": "$rand" },
"max": { "$max": "$rand" }
}},
function(err,result) {
console.log(Date.now());
if (err) callback(err);
console.log(result);
callback();
}
);
}
],
function(err) {
if (err) throw err;
db.close();
}
);
});
});
And the results ( compared to what I expected ) are appauling in the "aggregate case".
For 10,000 documents:
1438964189731
1438964189737
[ { _id: 55c4d9dc57c520412399bde4, rand: '1000bf6bda089c00000' },
{ _id: 55c4d9dd57c520412399c731, rand: 'fff95e4662e6600000' } ]
1438964189741
1438964189773
[ { _id: null,
min: '1000bf6bda089c00000',
max: 'fff95e4662e6600000' } ]
Which indicates a difference of 6 ms for the parallel case, and a huge difference of 32ms for the aggregation case.
Can this get better? No:
For 100,000 documents:
1438965011402
1438965011407
[ { _id: 55c4dd036902125223a05958, rand: '10003bab87750d00000' },
{ _id: 55c4dd066902125223a0a84a, rand: 'fffe9714df72980000' } ]
1438965011411
1438965011640
[ { _id: null,
min: '10003bab87750d00000',
max: 'fffe9714df72980000' } ]
And the results still clearly show 5 ms which is close to the result of 10 times less the data and with the aggregation case this is 229 ms slower, nearly a factor of 10 ( the increased amount ) slower than the previous sample.
But wait, because it gets worse. Let's increase the sample to 1,000,000 entries:
1,000,000 document sample:
1438965648937
1438965648942
[ { _id: 55c4df7729cce9612303e39c, rand: '1000038ace6af800000' },
{ _id: 55c4df1029cce96123fa2195, rand: 'fffff7b34aa7300000' } ]
1438965648946
1438965651306
[ { _id: null,
min: '1000038ace6af800000',
max: 'fffff7b34aa7300000' } ]
This is actually the worst, becuase whilst the "parallel" case still continues to exhibit a 5ms response time, the "aggregation" case now blows out to a whopping 2360ms (wow, over 2 whole seconds). Which only has to be considered to be totally unacceptable as a differential from the alternate approach time. That is 500 times the execution cycle, and in computing terms that is huge.
Conclusions
Never make a bet on something unless you know a sure winner.
Aggregation "should" win here as the principles behind the results are basically the same as the "parallel excecution case" in the basic algorithm to pick the results from the keys of the index which is available.
This is a "fail" ( as my kids are fond of saying ) where the aggregation pipeline needs to be tought by someone ( my "semi-partner" is good at these things ) to go back to "algorithm school" and re-learn the basics that are being used by it's poorer cousin to producemuch faster results.
So the basic lesson here is:
We think the "aggregate" accumulators should be optimized to do this, but at present they clearly are not.
Of you want the fastest way to determine min/max on a collection of data ( without and distinct keys ) then a parallel query execution using the .sort()
modfier is actually much faster than any alternative. ( with an index ).
So for people wanting to do this over a collection of data, use a parallel query as shown here. It's much faster ( until we can teach operators to be better :> )
I should note here that all timings are relative to hardware, and it is mainly the "comparison" of timings that is valid here.
These results are from my ( ancient ) laptop
- Core I7 CPU (8x cores)
- Windows 7 Host ( yes could not be bothered to re-install )
- 8GB RAM Host
- 4GB Allocated VM ( 4x core allocation )
- VirtualBox 4.3.28
- Ubuntu 15.04
- MongoDB 3.1.6 (Devel)
And the latest "stable" node versions for packages as required in the listing here.