1

I am playing with Node.js and I have created a simple script that uploads files from a directory to a server:

var request = require('request');
var file = require('file');
var fs = require('fs');
var path = require('path');


VERSION = '0.1'
CONFIG_FILE = path.join(__dirname, 'etc', 'sender.conf.json');


var config = JSON.parse(
  fs.readFileSync(CONFIG_FILE).toString()
);

var DATA_DIR = __dirname
config['data_dir'].forEach(function(dir) {
  DATA_DIR = path.join(DATA_DIR, dir)
});


console.log('sending data from root directory: ' + DATA_DIR);
file.walk(
  DATA_DIR,
  function(err, dir_path, dirs, files) {
    if(err) {
      return console.error(err);
    } 
    sendFiles(dir_path, files);
  } 
);

function sendFiles(dir_path, files)
{
  files
    .filter(function(file) {
      return file.substr(-5) === '.meta';
    })
    .forEach(function(file) {
      var name = path.basename(file.slice(0, -5));
      sendFile(dir_path, name);
    })
  ; 
} 

function sendFile(dir_path, name)
{
  console.log("reading file start: " + dir_path + "/" + name);
  fs.readFile(
    path.join(dir_path, name + '.meta'),
    function(err, raw_meta) {
      if(err) {
        return console.error(err);
      }
      console.log("reading file done: " + dir_path + "/" + name);
      sendData(
        name,
        JSON.parse(raw_meta),
        fs.createReadStream(path.join(dir_path, name + '.data'))
      );
    }
  );
  console.log("reading file async: " + dir_path + "/" + name);
}

function sendData(name, meta, data_stream)
{ 
  meta['source'] = config['data_source'];

  var req = request.post(
    config['sink_url'],
    function(err, res, body) {
      if(err) {
        console.log(err);
      }
      else {
        console.log(name);
        console.log(meta);
        console.log(body);
      }
    }
  );
  var form = req.form();

  form.append(
    'meta',
    JSON.stringify(meta),
    { 
      contentType: 'application/x-www-form-urlencoded'
    }
  );

  form.append(
    'data',
    data_stream
  );
}

It works fine, when run with only a few files. But when I run it on directory with lots of files, it chokes. This is because it keeps creating huge amounts of tasks for reading from a file, but never gets to actually doing the reading (because there is too many files). This can be observed on output:

sending data from root directory: .../data
reading file start: .../data/ac/ad/acigisu-adruire-sabeveab-ozaniaru-fugeef-wemathu-lubesoraf-lojoepe
reading file async: .../data/ac/ad/acigisu-adruire-sabeveab-ozaniaru-fugeef-wemathu-lubesoraf-lojoepe
reading file start: .../data/ac/ab/acodug-abueba-alizacod-ugvut-nucom
reading file async: .../data/ac/ab/acodug-abueba-alizacod-ugvut-nucom
reading file start: .../data/ac/as/acigisu-asetufvub-liwi-ru-mitdawej-vekof
reading file async: .../data/ac/as/acigisu-asetufvub-liwi-ru-mitdawej-vekof
reading file start: .../data/ac/av/ace-avhad-bop-rujan-pehwopa
reading file async: .../data/ac/av/ace-avhad-bop-rujan-pehwopa
...

For each file, there is console output "reading file start" produced immediately before call to fs.readFile, and "reading file async" that is produced immediately after the async reading has been scheduled. But there is no "reading file done" message even when I let it run for a long time, which means that reading of any file has probably never been even scheduled (those files are on order of 100s of bytes, so once scheduled, those reads would probably finish in single go).

This leads me to the following thought process. Async calls in Node.js are done because the event loop itself is single-threaded and we do not want to block it. However, once this requirement is satisfied, does it make any sense to nest further async calls into async calls that are themselves nested in async calls, etc.? Would it serve any particular purpose? Moreover, would not it be actual pessimisation of the code due to scheduling overhead that is not really needed and can be completely avoided if complete handling of single file have consisted of synchronous calls only?

Given the thought process above, my course of action would be to use solution from this question:

  • asynchronously push names of all files to async.queue
  • limit number of parallel tasks by setting queue.concurrency
  • provide file-upload handler that is completely synchronous, i.e. it synchronously reads contents of the file and after that is finished, it synchronously sends POST request to the server

This is my very first try to use Node.js and/or JavaScript, therefore it is quite possible I am completely wrong (note that e.g. sync-request package makes it very clear that synchronous calls are not desirable, which is in contradiction with my thought process above - the question is why). Any comments on validity of the above thought process as well as viability of the proposed solution and eventual alternatives to it would be very much appreciated.

Community
  • 1
  • 1
mareq
  • 396
  • 5
  • 11
  • Thinking about it a bit more, my proposed solution does not really solve the problem: I need to limit adding files to the queue, not processing files (what would be the best way to do that?). Still, I would keep the file-handler synchronous (called asynchronously, but not spawning any more asynchronous tasks by itself). – mareq Apr 20 '17 at 12:02

1 Answers1

0

== Update ==

There is very good article explaining all this in great detail directly in documentation of Node.js.

As for the particular problem at hand, it is indeed in the choice of file-system-walker-module. The solution is to use e.g. walk instead of file:

@@ -4,7 +4,7 @@


 var request = require('request');
-var file = require('file');
+var walk = require('walk');
 var fs = require('fs');
 var path = require('path');

@@ -24,13 +24,19 @@ config['data_dir'].forEach(function(dir) {


 console.log('sending data from root directory: ' + DATA_DIR);
-file.walk(
-  DATA_DIR,
-  function(err, dir_path, dirs, files) {
-    if(err) {
-      return console.error(err);
-    }
-    sendFiles(dir_path, files);
+var walker = walk.walk(DATA_DIR)
+walker.on(
+  'files',
+  function(dir_path, files, next) {
+    sendFiles(dir_path, files.map(function(stats) { return stats.name; }));
+    next();
+  }
+);
+walker.on(
+  'errors',
+  function(dir_path, node_stats, next) {
+    console.error('file walker:', node_stats);
+    next();
   }
 );

== Original Post ==

After a bit more study, I will attempt to answer my own question. This answer is still only a partial solution (more complete answer from someone who has actual experience with Node.js would be very much appreciated).

The short answer to the main question above is that it indeed is not only desirable, but also almost always necessary to schedule more asynchronous functions from already asynchronous functions. The long explanation follows.

It is because of how Node.js scheduling works: "Everything runs on a different thread except our code.". There are two very important comments in the discussion below the linked blog post:

  • "Javascript always finishes the currently executing function first. An event will never interrupt a function." [Twitchard]
  • "Also note it won't just finish the current function, it will run to completion of all synchronous functions and I believe anything queued with process.nextTick... before the request callback is handled." [Tim Oxley]

There is also a note mentioning this in the documentatoin of the process.nextTick: "The next tick queue is completely drained on each pass of the event loop before additional I/O is processed. As a result, recursively setting nextTick callbacks will block any I/O from happening, just like a while(true); loop."

So, to summarize, all code of the script itself is running on single thread and single thread only. The asynchronous callbacks scheduled to be run are executed on that very same single thread and they are executed only after whole current next tick queue has been drained. Use of asynchronous callbacks provide the only point, when some other function can be scheduled to be run. If the file-upload handler would not schedule any additional asynchronous tasks as described in the question, its execution would block everything else until that whole file-upload handler will have been finished. That is not desirable.

This also explains why the actual reading of the input file never occurs ("recursively setting nextTick callbacks will block any I/O from happening" - see above). It eventually would occur after all the tasks for whole directory hierarchy traversed will have been scheduled. However, without further study, I am not able to answer the question how to limit the number of file-upload tasks scheduled (effectively size of the task queue) and block the scheduling loop until some of those tasks will have been processed (some room on the task queue has been freed). Hence this answer is still incomplete.

mareq
  • 396
  • 5
  • 11
  • Would this be fixed by replacing call to [`return walk(dirs, walker, mode, callback);`](https://github.com/aconbere/node-file-utils/blob/72dc5839ec7553e65cc72b6c77f89dbfb7f430e5/lib/file.js#L62) by call to [`setImmediate`](http://stackoverflow.com/questions/17502948/nexttick-vs-setimmediate-visual-explanation)? (It would not be possible to have the return value, but that is different issue.) – mareq Apr 22 '17 at 14:43