3

I'm new to Node.js. I've been working my way through "Node.js the Right Way" by Jim R. Wilson and I'm running into a contradiction in the book (and in Node.js itself?) that I haven't been able to reconcile to my satisfaction with any amount of googling.

It's stated repetitively in the book and in other resources I have looked at online that Node.js runs callbacks in response to some event line-by-line until completion, then the event loop proceeds with waiting or invoking the next callback. And because Node.js is single-threaded (and short of explicitly doing anything with the cluster module, also runs as a single process), my understanding is that there is only ever, at most, one chunk of JavaScript code executing at a time.

Am I understanding that correctly? Here's the contradiction (in my mind). How is Node.js so highly concurrent if this is the case?

Here is an example straight from the book that illustrates my confusion. It is intended to walk a directory of many thousands of XML files and extract the relevant bits of each into a JSON document.

First the parser:

'use strict';
const
  fs = require('fs'),
  cheerio = require('cheerio');

module.exports = function(filename, callback) {
  fs.readFile(filename, function(err, data){
    if (err) { return callback(err); }
    let
      $ = cheerio.load(data.toString()),
      collect = function(index, elem) {
        return $(elem).text();
      };

    callback(null, {
      _id: $('pgterms\\:ebook').attr('rdf:about').replace('ebooks/', ''), 
      title: $('dcterms\\:title').text(), 
      authors: $('pgterms\\:agent pgterms\\:name').map(collect), 
      subjects: $('[rdf\\:resource$="/LCSH"] ~ rdf\\:value').map(collect) 
    });
  });
};

And the bit that walks the directory structure:

'use strict';
const

  file = require('file'),
  rdfParser = require('./lib/rdf-parser.js');

console.log('beginning directory walk');

file.walk(__dirname + '/cache', function(err, dirPath, dirs, files){
  files.forEach(function(path){
    rdfParser(path, function(err, doc) {
      if (err) {
        throw err;
      } else {
        console.log(doc);
      }
    });
  });
});

If you run this code, you will get an error resulting from the fact that the program exhausts all available file descriptors. This would seem to indicate that the program has opened thousands of files concurrently.

My question is... how can this possibly be, unless the event model and/or concurrency model behave differently than how they have been explained?

I'm sure someone out there knows this and can shed light on it, but for the moment, color me very confused!

Paul Sweatte
  • 24,148
  • 7
  • 127
  • 265
Kent Rancourt
  • 1,555
  • 1
  • 11
  • 19
  • The question is so long. `const` should only be used for scalar values. – Ryan Jul 27 '14 at 20:57
  • 1
    [This](http://rickgaribay.net/archive/2012/01/28/node-is-not-single-threaded.aspx) is a pretty good resource if you like reading. – Goodzilla Jul 27 '14 at 20:58
  • @true, this is not my code. This is an example taken straight from a book and are meant to illustrate my actual question, which has nothing to do with const at all. Your comment isn't helpful or constructive in any way. – Kent Rancourt Jul 27 '14 at 21:01
  • I actually wasn't trying to be helpful. I was suggesting an improvement. – Ryan Jul 27 '14 at 21:03
  • Where exactly are you seeing the contradiction? – OrangeDog Jul 27 '14 at 21:08
  • @OrangeDog, the perceived contradiction is that Node.js does one thing at a time, but it has, in this case, thousands of files open for read concurrently. How can that be? – Kent Rancourt Jul 27 '14 at 21:15
  • It opened those thousands of files one at a time. Now it's reading from them and calling your callbacks one at a time. – OrangeDog Jul 28 '14 at 07:37

3 Answers3

4

Am I understanding that correctly?

Yes.

How is Node.js so highly concurrent if this is the case?

Not the javascript execution itself is concurrent - the IO (and other heavy tasks) is. When you call an asynchronous function, it will start the task (for example, reading a file) and return immediately to "run the next line of the script" as you put it. The task however will continue in the background (concurrently) to read the file, and once it's finished it will put the callback that has been assigned to it onto the event loop queue which will call it with the then available data.

For details on this "in the background" processing, and how node actually manages to run all these asynchronous tasks in parallel, have a look at the question Nodejs Event Loop.

Community
  • 1
  • 1
Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • I think, the author's question is exactly, that what will do the job, after "return immediately to 'run the next line of the script'". So there is a task somewhere, what needs to be done, but the control is not here, to fulfill the task. For me, its confusing too, if i am not thinking in some kind of subprocesses, threads, etc.. – Attila Kling Jul 27 '14 at 21:20
  • It's starting to make sense. Would this be an accurate summary of what's going on? The for/each loop is sequentially scheduling (thousands of) I/O tasks that (unlike the JS) ARE executed concurrently. Because the single thread that's executing JS remains tied up with this task of scheduling these I/O tasks, Node's event loop is unable to invoke the handlers for any of the completed I/O reads and thus, all those files remain open. Am I close? – Kent Rancourt Jul 27 '14 at 21:23
  • @Jim-Y: Yes, there is a thread pool in the background, but it shouldn't really matter to you. – Bergi Jul 27 '14 at 21:34
  • @KentRancourt: Yes, that's what happens. However, notice that the directory walk is asynchronous as well, each loop over a directory runs in its own callback - with any waiting events possibly interleaving before the next subdirectory is iterated. So you'd need a quite flat tree with very very many files to really schedule all reads at the same time. – Bergi Jul 27 '14 at 21:39
2

This is a pretty simple description, and skips a lot of things.

files.forEach is not asynchnous. Therefore the code goes through the list of files in the directory, calling fs.readFile on each one, then returns to the event loop.

The loop then has a load of file open events to process, which will then queue up file read events. Then the loop can start going through and calling the callbacks to fs.readFile with the data that's been read. These can only be called one at a time: as you say there's only one thread executing javascript at any one time.

However, before any of these callbacks are called, you've already opened every file in that original list, leading to file handle exhaustion if there were too many.

OrangeDog
  • 36,653
  • 12
  • 122
  • 207
0

I think OrangeDog's answer is the correct answer to your specific question. But maybe you'll find this short and awesome presentation by Philip Roberts helpful, which explains the concept of the Event Loop and the asynchronous processing of JavaScript really nicely. Note that the video is not node.js related, because these principles apply to all JavaScript code.

Sabacc
  • 789
  • 6
  • 13
  • Maybe I should undelete it then? I didn't think it was very good. – OrangeDog Jul 28 '14 at 07:36
  • I didn't notice you deleted your answer. I think your answer was correctly answering the question (altough Bergi's answer is already accepted now). – Sabacc Jul 28 '14 at 09:31
  • @OrangeDog, I would have accepted your answer, but as Sabaac noted, it disappeared for a while and Bergi's answer was quire helpful in the meanwhile. I have given your answer an upvote tho and want to sincerely thank you for your help! – Kent Rancourt Jul 28 '14 at 17:23