0

Recently I made a webscraper in nodejs using 'promise'. I created a Promise for each url I wanted to scrape and then used all method:

var fetchUrlArray=[];
for(...){
    var mPromise = new Promise(function(resolve,reject){
        (http.get(...))()
    });
    fetchUrlArray.push(mPromise);
}
Promise.all(fetchUrlArray).then(...)

There were thousands of urls but only a few of them got timed out. I got the impression that it was handling 5 promises in parallel at a time. My question is how exactly does promise.all() work. Does it: Call each promise one by one and switch to the next one till the previous one is resolved. Or does in process the promises in a batch of a few from the array. Or does it fire all promises

What is the best way to solve this problem in nodejs. Because as it stands I can solve this problem way faster in Java/C#

Light
  • 375
  • 4
  • 11

3 Answers3

2

I would do it like this

Personally, I'm not a big fan of Promises. I think the API is extremely verbose and the resulting code is very hard to read. The method defined below results in very flat code and it's much easier to immediately understand what's going on. At least imo.

Here's a little thing I created for an answer to this question

// void asyncForEach(Array arr, Function iterator, Function callback)
//   * iterator(item, done) - done can be called with an err to shortcut to callback
//   * callback(done)       - done recieves error if an iterator sent one
function asyncForEach(arr, iterator, callback) {

  // create a cloned queue of arr
  var queue = arr.slice(0);

  // create a recursive iterator
  function next(err) {

    // if there's an error, bubble to callback
    if (err) return callback(err);

    // if the queue is empty, call the callback with no error
    if (queue.length === 0) return callback(null);

    // call the callback with our task
    // we pass `next` here so the task can let us know when to move on to the next task
    iterator(queue.shift(), next);
  }

  // start the loop;
  next();
}

You can use it like this

var urls = [
  "http://example.com/cat",
  "http://example.com/hat",
  "http://example.com/wat"
];

function eachUrl(url, done){
  http.get(url, function(res) {
    // do something with res
    done();
  }).on("error", function(err) {
    done(err);
  });
}

function urlsDone(err) {
  if (err) throw err;
  console.log("done getting all urls");
}

asyncForEach(urls, eachUrl, urlsDone);

Benefits of this

  • no external dependencies or beta apis
  • reusable on any array you want to perform async tasks on
  • non-blocking, just as you've come to expect with node
  • could be easily adapted for parallel processing
  • by writing your own utility, you better understand how this kind of thing works

If you just want to grab a module to help you, look into async and the async.eachSeries method.

Community
  • 1
  • 1
Mulan
  • 129,518
  • 31
  • 228
  • 259
  • 1
    This is much slower and and more error prone than the promise version. It doesn't have the functioning working stack traces, and it does not compose well or propagate errors. Moreover if you forget to call `.done` explicitly in even one place the app might hang and you might not know why. Promises solve all this. In addition - _properly_ promisified code would look a lot nicer. – Benjamin Gruenbaum Oct 21 '14 at 16:24
  • 1
    What are you talking about. "Slower"? By what measure? The stacktraces are perfectly legible and the errors can be handled in the done handler. Why would you "forget" to call `.done`? It's an async piece of code. If anything, a more explicit requirement is better than an implicit one. – Mulan Oct 21 '14 at 22:58
  • Oh I remember you, Promises Evangelist. Thanks for your subjective review of my code. – Mulan Oct 21 '14 at 23:01
2

What you pass Promise.all() is an array of promises. It knows absolutely nothing about what is behind those promises. All it knows is that those promises will get resolved or rejected sometime in the future and it will create a new master promise that follows the sum of all the promises you passed it. This is one of the nice things about promises. They are an abstraction that lets you coordinate any type of action (usually asynchronous) without regard for what type of action it is. As such, promises have literally nothing to do with the actual action. All they do is monitor the completion or error of the action and report that back to those agents following the promise. Other code actually runs the action.

In your particular case, you are immediately calling http.get() in a tight loop and your code (nothing to do with promises) is launching a zillion http.get() operations at once. Those will get fired as fast as the underlying transport can do them (likely subject to connection limits).

If you want them to be launched serially or in batches of say 10 at a time, then you have to code it that way yourself. Promises have nothing to do with that.

You could use promises to help you code them to launch serially or in batches, but it would take extra of your code to do that either way to make that happen.

The Async library is specifically built for running things in parallel, but with a maximum number in flight at any given time because this is a common scheme where you either have connection limits on your end or you don't want to overwhelm the receiving server. You may be interested in the parallelLimit option which lets you run a number of async operations in parallel, but with a maximum number in flight at any given time.

jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • Well written answer. Props. It's worth mentioning that while JavaScript runs like it's in a single thread actual I/O is performed asynchronously and on another thread in practice. It's also worth mentioning that you have a `{concurrency: X}` option in Bluebird. – Benjamin Gruenbaum Oct 21 '14 at 16:26
0

First, a clarification: A promise does represent the future result of a computation, nothing else. It does not represent the task or computation itself, which means it cannot be "called" or "fired".

Your script does create all those thousands of promises immediately, and each of those creations does call http.get immediately. I would suspect that the http library (or something it depends on) has a connection pool with a limit of how many requests to make in parallel, and defers the rest implicitly.

Promise.all does not do any "processing" - it's not responsible for starting the tasks and resolving the passed promises. It only listens to them and checks whether they all are ready, and returns a promise for that eventual result.

Bergi
  • 630,263
  • 148
  • 957
  • 1,375