2

I am working with zombie.js to scrape one site, I must to use the callback style to connect to each url. The point is that I have got an urls array and I need to process each urls using an async function. This is my first approach:

Array urls = {http..., http...};
function process_url(index)
{
   if(index == urls.length)
      return;

   async_function(url, 
                  function() { 
                        ... 
                        //parse the url 
                        ...
                        // Process the next url
                        process_url(index++);
                       }
                   );
}

process_url(0)

Without use someone third party nodejs library to use the asyn funtion as sync function or to wait for the function (wait.for, synchornized, mocha), this is the way that I though to resolve this problem, I don't know what would happen if the array is too big. Is the function released from the memory when the next function is called? or all the functions are in memory until the end?

Any ideas?

dlopezgonzalez
  • 4,217
  • 5
  • 31
  • 42

1 Answers1

5

Your scheme will work. I call it "manually sequencing async operations".

A general purpose version of what you're doing would look like this:

function processItem(data, callback) {
    // do your async function here
    // for example, let's suppose it was an http request using the request module
    request(data, callback);
}

function processArray(array, fn) {
    var index = 0;

    function next() {
        if (index < array.length) {
            fn(array[index++], function(err, result) {
                // process error here
                if (err) return;
                // process result here
                next();
            });
        }
    }
    next();
}

processArray(arr, processItem);

As to your specific questions:

I don't know what would happen if the array is too big. Is the function released from the memory when the next function is called? or all the functions are in memory until the end?

Memory in Javascript is released when it is not longer referenced by any running code and when the garbage collector gets time to run. Since you are running a series of asynchronous operations here, it is likely that the garbage collector gets a chance to run regularly while waiting for the http response from the async operation so memory could get cleaned up then. Functions are just another type of object in Javascript and they get garbage collected just like anything else. When they are no longer reference by running code, they are eligible for garbage collection.

In your specific code, because you are re-calling process_url() only in an async callback, there is no stack build-up (as in normal recursion). The prior instance of process_url() has already completed BEFORE the async callback is called and BEFORE you call the next iteration of process_url().


In general, management and coordination of multiple async operations is much, much easier using promises which are built into the current versions of node.js and are part of the ES6 ECMAScript standard. No external libraries are required to use promises in current versions of node.js.

For a list of a number of different techniques for sequencing your asynchronous operations on your array, both using promises and not using promises, see:

How to synchronize a sequence of promises?.

The first step in using promises is the "promisify" your async function so that it returns a promise instead of takes a callback.

function async_function_promise(url) {
    return new Promise(function(resolve, reject) {
        async_function(url, function(err, result) {
            if (err) {
                reject(err);
            } else {
                resolve(result);
            }
        });
    });
}

Now, you have a version of your function that returns promises.

If you want your async operations to proceed one at a time so the next one doesn't start until the previous one has completed, then a usual design pattern for that is to use .reduce() like this:

function process_urls(array) {
    return array.reduce(function(p, url) {
        return p.then(function(priorResult) {
            return async_function_promise(url);
        });
    }, Promise.resolve());
}

Then, you can call it like this:

var myArray = ["url1", "url2", ...];
process_urls(myArray).then(function(finalResult) {
    // all of them are done here
}, function(err) {
    // error here
});

There are also Promise libraries that have some helpful features that make this type of coding simpler. I, myself, use the Bluebird promise library. Here's how your code would look using Bluebird:

var Promise = require('bluebird');
var async_function_promise = Promise.promisify(async_function);

function process_urls(array) {
    return Promise.map(array, async_function_promise, {concurrency: 1});
}

process_urls(myArray).then(function(allResults) {
    // all of them are done here and allResults is an array of the results
}, function(err) {
    // error here
});

Note, you can change the concurrency value to whatever you want here. For example, you would probably get faster end-to-end performance if you increased it to something between 2 and 5 (depends upon the server implementation on how this is best optimized).

Community
  • 1
  • 1
jfriend00
  • 683,504
  • 96
  • 985
  • 979