optimizations for node.js async.map performance degrades for large record sets

Question

node.js javascript

I am performing large-scale data-processing via an async operation that performs hash lookups for a set of keys :

function dataProcess(keys, callback) {

    // PERFORMANCE TRACKING VARIABLES
    var iteratorCounter = 0;
    var timestampInitial = Date.now();
    var timestampLast = timestampInitial;
    // PERFORMANCE TRACKING VARIABLES

    async.mapLimit(keys, 100, iterator, function(err, result) {
        if (err) {
            callback(err, null);
        } else {
            var returnArray = result.filter(function(n) {
                return n;
            });
            callback(null, returnArray);
        };
    });

    function iterator(key, iteratorCallback) {
       // lookup key in hash to retrieve a record ...
       // if non-null , do some light data-processing and return it 
       // BEGIN PERFORMANCE TRACKING CODE
       iteratorCounter++;
       if (iteratorCounter % 1000 == 0) {
           var timestampNow = Date.now();
           var timeElapsedBatch = timestampNow - timestampLast;
           var timeElapsedTotal = timwstampNow - timestampStart;
           var timestampLast = timestampNow;
           console.log('iterator count = ' + iteratorCounter);
           console.log('elapsed time this batch = ' + timeElapsedBatch);
           console.log('elapsed time total = ' + timeElapsedTotal); 
       } 
       // END PERFORMANCE TRACKING CODE
       setImmediate(iteratorCallback, null, record);
    }
});

I have added tracking code to enable performance-testing.

I am running perf tests on sets of 100K+ keys. Overall performance is unacceptable at ~10 mins for ~100K keys processed.

Additionally, I am noticing significant non-linear performance degradation near the end of processing. For most of the run, per-batch processing-time is relatively constant, varying between ~4-7 seconds per batch of 1K keys. However, the final batches of 1K keys follow an erratic pattern of slowing exponentially before tapering off : ~ 4 secs => ~0 secs -> ~72 secs => ~220 seconds => ~165 seconds => ~0 seconds.

My hunch is that the final performance degradation might in part be due to the async map operation having to push records into an array that has become quite large and costly to resize.

What optimizations are available ?

What's the synchronous `filter` operation that is done in the end good for? Can't you do that asynchronously as well? — Bergi, Jul 26 '14 at 09:01
@Bergi the key collection is processed to construct an array of object records that is passed on to another stage in the data-processing pipeline ( bit difficult to explain total context ). The `result.filter` operation strips-out any null objects returned by `iterator` -- since this operation occurs after `async.mapLimit` has completed processing of all keys ( collapses all `iterator` results ) how would I refactor it to be asynchronous ? — BaltoStar, Jul 26 '14 at 12:38
You'd manage the array yourself and use [`eachLimit`](https://github.com/caolan/async#eachlimitarr-limit-iterator-callback) for the iteration, pushing to the result only when necessary. I don't know whether the number to-be-filtered items in your sample data is significant, of course. — Bergi, Jul 26 '14 at 13:24
What exactly do you mean by "*final 3 batches of 1K keys*"? The iterator is not running any batches. Or do you mean that `keys` is an array that contains 1K-key-batches? — Bergi, Jul 26 '14 at 13:27
@Bergi I've edited my question to better explain how I am tracking performance. — BaltoStar, Jul 26 '14 at 17:02
Uh, is the lookup and the "light data processing" actually asynchronous? If not, I don't see much reason to use the `async` module. Also, what else is going on at the server while you run this? — Bergi, Jul 26 '14 at 18:59
No there is nothing inherently asynchronous about the hash lookup and data-cleaning operations. For some reason I was under the impression that the `async` lib could speed up this type of repetitive operation, but I could be mistaken. The server is a RESTful service that exposes multiple APIs and therefor because the `dataProcess()` function is long-running ( ~10 mins ) it can not block handling of other requests to various endpoints , but I suppose providing a callback satisfies that requirement ... ? — BaltoStar, Jul 27 '14 at 03:50
Yeah, you're [mistaken about that](http://stackoverflow.com/q/21631241/1048572). It's not about "a callback", but about using `setImmediate`. Which could get a bit slow if called 100K+ times. — Bergi, Jul 27 '14 at 09:19

optimizations for node.js async.map performance degrades for large record sets

0 Answers0