2

Hi I want to get the min and a max value of a field in my db.

I found this solution which queries and sorts the results: get max value in mongoose I could do this twice and combine it with async.parallel to write it non-blocking. But I guess two db queries may not be the best solution.

The second solution would be to use aggregate. But I don't want to group anything. I only want to use $match to filter (filter criteria are always diff and can be {}) and run the query with all documents in my collection.

http://docs.mongodb.org/manual/reference/operator/aggregation/min/ http://docs.mongodb.org/manual/reference/operator/aggregation/max/

Question)

  1. Can I run this in one query with aggregate, maybe with $project
  2. Is there another method than aggregate that works without grouping
  3. Will 1)/2) be more time efficient than the first solution with sorting?

EDIT: Solved with the first solution, but I think there is a more efficient solution because this needs two database operations:

    async.parallel
      min: (next) ->
        ImplantModel.findOne(newFilter).sort("serialNr").exec((err, result) ->
          return next err if err?
          return next null, 0 if !result?
          next(null, result.serialNr)
        )
      max: (next) ->
        ImplantModel.findOne(newFilter).sort("-serialNr").exec((err, result) ->
          return next err if err?
          return next null, 0 if !result?
          next(null, result.serialNr)
        )
      (err, results) ->
        console.log results.min, ' ', results.max
        return callback(err) if err?
        return callback null, {min:results.min, max:results.max}
  )
Community
  • 1
  • 1
Andi Giga
  • 3,744
  • 9
  • 38
  • 68
  • Can you run what query? You say you found something and link to it but you don't say what **you** want to do. Your operation is what is important. Show what **your** task is to achieve. Not sure why you seem scared of aggregate, and `$project` is an aggregation pipeline stage. So you are probably not thinking with all the correct information. Also Do not title or ask for "Best Solution". Show your problem and you will get an answer. – Blakes Seven Aug 07 '15 at 09:36
  • Hi, to get the minimum and the maximum value of a field in my collection. The aggregate docs describes exactly what I want to achieve, but it explicilty says you can only use it in the group stage. And I don't need to group anything: `$max is an accumulator operator available only in the $group stage` So I didn't try it. I post some code to show how I solved it with the first solution but I think it is not a good solution. – Andi Giga Aug 07 '15 at 09:55
  • 1
    Really not sure what the question is here. *"I don't need to group anything"*. Yes you do, you want the "min" and "max" values and that is a "grouping" operation as is clearly defined. Can you possibly do with parallel queries? Sure you can, but it still is "two" queries that you could do in one. Benchmarking is important here, and I'd be pretty sure the aggregation operation wins. You group by `null` BTW to do the whole collection. – Blakes Seven Aug 07 '15 at 10:18
  • Ah ok so I can use `_id:null`instead of ` _id: "$item"` as in the mongoose example. I thought you need to define an id? – Andi Giga Aug 07 '15 at 10:29
  • The `_id` field in `$group` is the "key" to group on. If it's not a document value (i.e null) then the whole collection will be grouped. It seemed pretty clear from the outset there was a misunderstanding here. I suggest benchmarking. $10 for me on the aggregation statement with `null` – Blakes Seven Aug 07 '15 at 10:31
  • Thank you now it is clear to me too. I had the misunderstanding that group needs to be field. I just have to find somewhere how to benchmark now ;-) I will check if I get it working with aggregate later and post the compared results. – Andi Giga Aug 07 '15 at 10:43
  • No need for you benchmark as I supplied my own. Pretty shocking really. My partner will have a "field day with this". In fact, I told him and he is not happy. The algorithms clearly need to change. – Blakes Seven Aug 07 '15 at 17:03
  • I only wish your question "body" was more "succinct" to what is being asked here *"The difference between parallel sorts and aggregation"* as this would be a more **canonical** resource for people to search for when they are looking for this kind of optimization. Still "upvote" for the topic once it was made clear. – Blakes Seven Aug 07 '15 at 17:06

1 Answers1

3

Don't know what it is about this question, and sure to get no real love from the response but I just could not let it go and get to sleep without resolving.

So the first thing to say is I think I owe the OP here $10, because my expected results are not the case.

The basic idea presented here is a comparison of:

  • Using parallel execution of queries to find the "maximum" ( sorted total value ) af a field and also the minimum value by the same constraint

  • The aggregation framework $max and $min grouping accumulators over the whole collection.

In "theory" these two options are doing exactly the same thing. And in "theory" even though parallel execution can happen "over the wire" with simultaneous requests to the server, there still should be an "overhead" inherrent in those requests and the "aggregation" function in the client to bring both results together.

The tests here run a "series" execution of creating random data of a reasonable key length, the to be "fair" in comparison the "key" data here is also indexed.

The next "fairness" stage is to "warm up" the data, by doing a sequential "fetch" on all items, to simulate loading as much of the "working set" of data into memory as the client machine is capable.

Then we run each test, in comparison and series so as not to compete against eachover for resources, for either the "parallel query" case or the "aggregation" case to see the results with timers attached to the start and end of each excution.

Here is my testbed script, on the basic driver to keep thing as lean as possible ( nodejs environment considered ):

var async = require('async'),
    mongodb = require('mongodb'),
    MongoClient = mongodb.MongoClient;

var total = 1000000;

MongoClient.connect('mongodb://localhost/bigjunk',function(err,db) {
  if (err) throw err;

  var a = 10000000000000000000000;

  db.collection('bigjunk',function(err,coll) {
    if (err) throw err;

    async.series(
      [
        // Clean data
        function(callback) {
          console.log("removing");
          coll.remove({},callback);
        },

        // Insert data
        function(callback) {
          var count = 0,
              bulk = coll.initializeUnorderedBulkOp();

          async.whilst(
            function() { return count < total },
            function(callback) {
              var randVal = Math.floor(Math.random(a)*a).toString(16);
              //console.log(randVal);
              bulk.insert({ "rand": randVal });
              count++;

              if ( count % 1000 == 0 ) {
                if ( count % 10000 == 0 ) {
                  console.log("counter: %s",count);     // log 10000
                }

                bulk.execute(function(err,res) {
                  bulk = coll.initializeUnorderedBulkOp();
                  callback();
                });
              } else {
                callback();
              }

            },
            callback
          );
        },

        // index the collection
        function(callback) {
          console.log("indexing");
          coll.createIndex({ "rand": 1 },callback);
        },

        // Warm up
        function(callback) {
          console.log("warming");
          var cursor = coll.find();

          cursor.on("error",function(err) {
            callback(err);
          });

          cursor.on("data",function(data) {
            // nuthin
          });

          cursor.on("end",function() {
            callback();
          });
        },

        /*
         * *** The tests **
         */

        // Parallel test
        function(callback) {
          console.log("parallel");
          console.log(Date.now());
          async.map(
            [1,-1],
            function(order,callback) {
                coll.findOne({},{ "sort": { "rand": order } },callback);
            },
            function(err,result) {
              console.log(Date.now());
              if (err) callback(err);
              console.log(result);
              callback();
            }
          );
        },

        function(callback) {
          console.log(Date.now());
          coll.aggregate(
            { "$group": {
              "_id": null,
              "min": { "$min": "$rand" },
              "max": { "$max": "$rand" }
            }},
            function(err,result) {
              console.log(Date.now());
              if (err) callback(err);
              console.log(result);
              callback();
            }
          );
        }
      ],
      function(err) {
        if (err) throw err;
        db.close();
      }
    );
  });
});

And the results ( compared to what I expected ) are appauling in the "aggregate case".

For 10,000 documents:

1438964189731
1438964189737
[ { _id: 55c4d9dc57c520412399bde4, rand: '1000bf6bda089c00000' },
  { _id: 55c4d9dd57c520412399c731, rand: 'fff95e4662e6600000' } ]
1438964189741
1438964189773
[ { _id: null,
    min: '1000bf6bda089c00000',
    max: 'fff95e4662e6600000' } ]

Which indicates a difference of 6 ms for the parallel case, and a huge difference of 32ms for the aggregation case.

Can this get better? No:

For 100,000 documents:

1438965011402
1438965011407
[ { _id: 55c4dd036902125223a05958, rand: '10003bab87750d00000' },
  { _id: 55c4dd066902125223a0a84a, rand: 'fffe9714df72980000' } ]
1438965011411
1438965011640
[ { _id: null,
    min: '10003bab87750d00000',
    max: 'fffe9714df72980000' } ]

And the results still clearly show 5 ms which is close to the result of 10 times less the data and with the aggregation case this is 229 ms slower, nearly a factor of 10 ( the increased amount ) slower than the previous sample.

But wait, because it gets worse. Let's increase the sample to 1,000,000 entries:

1,000,000 document sample:

1438965648937
1438965648942
[ { _id: 55c4df7729cce9612303e39c, rand: '1000038ace6af800000' },
  { _id: 55c4df1029cce96123fa2195, rand: 'fffff7b34aa7300000' } ]
1438965648946
1438965651306
[ { _id: null,
    min: '1000038ace6af800000',
    max: 'fffff7b34aa7300000' } ]

This is actually the worst, becuase whilst the "parallel" case still continues to exhibit a 5ms response time, the "aggregation" case now blows out to a whopping 2360ms (wow, over 2 whole seconds). Which only has to be considered to be totally unacceptable as a differential from the alternate approach time. That is 500 times the execution cycle, and in computing terms that is huge.

Conclusions

Never make a bet on something unless you know a sure winner.

Aggregation "should" win here as the principles behind the results are basically the same as the "parallel excecution case" in the basic algorithm to pick the results from the keys of the index which is available.

This is a "fail" ( as my kids are fond of saying ) where the aggregation pipeline needs to be tought by someone ( my "semi-partner" is good at these things ) to go back to "algorithm school" and re-learn the basics that are being used by it's poorer cousin to producemuch faster results.

So the basic lesson here is:

  • We think the "aggregate" accumulators should be optimized to do this, but at present they clearly are not.

  • Of you want the fastest way to determine min/max on a collection of data ( without and distinct keys ) then a parallel query execution using the .sort() modfier is actually much faster than any alternative. ( with an index ).

So for people wanting to do this over a collection of data, use a parallel query as shown here. It's much faster ( until we can teach operators to be better :> )


I should note here that all timings are relative to hardware, and it is mainly the "comparison" of timings that is valid here.

These results are from my ( ancient ) laptop

  • Core I7 CPU (8x cores)
  • Windows 7 Host ( yes could not be bothered to re-install )
  • 8GB RAM Host
  • 4GB Allocated VM ( 4x core allocation )
  • VirtualBox 4.3.28
  • Ubuntu 15.04
  • MongoDB 3.1.6 (Devel)

And the latest "stable" node versions for packages as required in the listing here.

Blakes Seven
  • 49,422
  • 14
  • 129
  • 135