pre-allocation of records using count

Question

I've read that pre-allocation of a record can improve the performance, which should be beneficial especially when handling many records of a time series dataset.

updateRefLog = function(_ref,year,month,day){
    var id = _ref,"|"+year+"|"+month;
    db.collection('ref_history').count({"_id":id},function(err,count){
        // pre-allocate if needed
        if(count < 1){
            db.collection('ref_history').insert({
                "_id":id
                ,"dates":[{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0}]
            });
        }

        // update
        var update={"$inc":inc['dates.'+day+'.count'] = 1;};
        db.collection('ref_history').update({"_id":id},update,{upsert: true},
            function(err, res){
                if(err !== null){
                    //handle error
                }
            }
        );
    });
};

I'm a little concerned that having to go through a promise might slow this down, and possibly checking for count every time would negate the performance benefit of pre allocating a record.

Is there a more performant way to handle this?

What version of MongoDB are you using .. and if MongoDB 3.0 or newer, what storage engine? Preallocation is a useful optimisation technique for the MMAP storage engine, but will add overhead for other storage engines (eg. WiredTiger) that do not support in-place updates. — Stennie, Mar 05 '16 at 02:04

score 1 · Accepted Answer · edited Mar 03 '16 at 16:12

The general statement of "pre-allocation" is about the potential cost of an "update" operation that causes the document to "grow". If that results in a document size that is greater than the currently allocated space, then the document would be "moved" to another location on disk to accomodate the new space. This can be costly, and hence the general recommendation to intially write the document befitting to it's eventual "size".

Honestly the best way to handle such an operation would be to do an "upsert" initially with all the array elements allocated, and then only update the requried element in position. This would reduce to "two" potential writes, and you can further reduce to a single "over the wire" operation using Bulk API methods:

var id = _ref,"|"+year+"|"+month;
var bulk = db.collection('ref_history').initializeOrderedBulkOp();

bulk.find({ "_id": id }).upsert().updateOne({
    "$setOnInsert": {
        "dates": Array.apply(null,Array(32)).map(function(el) { return { "count": 0 }})
   }
});

var update={"$inc":inc['dates.'+day+'.count'] = 1;};
bulk.find({ "_id": id }).updateOne(update);

bulk.execute(function(err,results) {
   // results would show what was modified or not
});

Or since newer drivers are favouring consistency with one another, the "Bulk" parts have been relegated to regular arrays of WriteOperations instead:

var update={"$inc":inc['dates.'+day+'.count'] = 1;};

db.collection('ref_history').bulkWrite([
    { "updateOne": {
        "filter": { "_id": id },
        "update": {
            "$setOnInsert": {
                "dates": Array.apply(null,Array(32)).map(function(el) {
                    return { "count": 0 }
                })
            }
        },
        "upsert": true
    }},
    { "updateOne": {
        "filter": { "_id": id },
        "update": update
    }}
],function(err,result) {
    // same thing as above really
});

In either case the $setOnInsert as the sole block will only do anything if an "upsert" actually occurs. The main case being that the only contact with the server will be a single request and response, as opposed to "back and forth" operations waiting on network communication.

This is typically what "Bulk" operations are for. They reduce that network overhead when you might as well send a batch of requests to the server. The result significantly speeds things, and neither operation is really dependant on the other with the exception of the exception of "ordered", which is the default in the latter case, and explicitly set by the legacy .initializeOrderedBulkOp().

Yes there is a "little" overhead in the "upsert", but there is "less" than in testing with .count() and waiting for that result first.

N.B Not sure about the 32 array entries in your listing. You possibly meant 24 but copy/paste got the better of you. At any rate there are better ways to do that than hardcoding, as is demonstrated.

I meant it to be 31 entries (for max days in month), but I'm switching it to an object with dates as keys instead of an array, that way I can start at 1 instead of at 0. — Daniel, Mar 03 '16 at 16:11
@Daniel Which is also valid, but of course unless you again "pre-allocate" the space then there is the potential to for the document to move again. Whatever your actual end implementation is, then the best thing to do it "upsert" rather than wait for a return from a count. That was your question and therefore that is what the answer says. Even if you resolved to only attempt the full write **once** and even if you deleted all keys/array-entries afterwards, you still need to decide to write the new document somehow. Upserts trump `.find()` then `.insert()` every time. — Blakes Seven, Mar 03 '16 at 21:15
I'm finding that in my tests, the fastest is not to pre-allocate at all, and the find method seems to outperform the the bulk insert by a slim margin. It is likely that there's something with my setup causing this, but I'm going to deferr performance tuning for the time being. — Daniel, Mar 03 '16 at 22:54
@Daniel That's very easy to be subjective and not truly test all cases. Yes it's true that "not all growth" would result in a document being moved, which is the main cost when that occurs. So if your documents never grow beyond the initial default allocation then it would not be necesary to add all keys. But it still does not change the fact that you need to create a new document at some time, and unless you are sure that you are always going to create before any data is writen then the initial "upsert" is still the way to go. — Blakes Seven, Mar 03 '16 at 23:12

pre-allocation of records using count

1 Answers1