0

I don't even know how to properly ask this question but I have concerns about the performance (mostly memory consumption) of the following code. I am anticipating that this code will consume a lot of memory because of map on a large set and a lot of 'hanging' functions that wait for external service. Are my concerns justified here? What would be a better approach?

var list = fs.readFileSync('./mailinglist.txt') // say 1.000.000 records
  .split("\n")
  .map( processEntry );

var processEntry = function _processEntry(i){
  i = i.split('\t');
  getEmailBody( function(emailBody, name){
      var msg = {
      "message" :  emailBody,
      "name" : i[0]
      }
      request(msg, function reqCb(err, result){
        ...
      });
  }); // getEmailBody
}

var getEmailBody = function _getEmailBody(obj, cb){
// read email template from file; 
// v() returns the correct form for person's name with web-based service
  v(obj.name, function(v){
    cb(obj, v)
  });
}
Maciej Jankowski
  • 2,794
  • 3
  • 26
  • 33
  • If you're concerned about fitting everything into memory, you may be interested in [`fs.createReadStream`](http://nodejs.org/api/fs.html#fs_fs_createreadstream_path_options), which returns a [`stream.Readable`](http://nodejs.org/api/stream.html#stream_class_stream_readable). That would allow you to process the file in chunks. – apsillers Sep 16 '14 at 12:31
  • `getEmailBody` seems to be asynchronous. How do you want to `map` with that at all? – Bergi Sep 16 '14 at 13:06
  • Isn't map() lazy? And isn't there a function to read a file line by line in Node? – Adrian Kalbarczyk Sep 16 '14 at 14:17
  • @apsillers - chunks - cool, but then it would be jumping through more hoops to split it into lines? – Maciej Jankowski Sep 16 '14 at 16:03
  • @Bergi - that is the question - `map()` would do its job, leaving me with a million async callbacks pending - yes? no? – Maciej Jankowski Sep 16 '14 at 16:04
  • @AdrianKalbarczyk I think it is lazy, but in this case functions have callbacks which I believe would end hanging until they receive data from web service `v()`. There is no function to read line by line. – Maciej Jankowski Sep 16 '14 at 16:06
  • @MaciejJankowski: No, it wouldn't, because async functions return `undefined` and you'd get back an array of undefineds. Of course, you also need to do some throttling so that not a million requests are sent at once, DOSing the service. – Bergi Sep 16 '14 at 16:09
  • @Bergi but what about pending callbacks when the function returns - aren't they are still pending? am I getting mindfscked with Javascript again? – Maciej Jankowski Sep 16 '14 at 16:17
  • @MaciejJankowski: Yes, they're still pending an will be executed somewhen in the future. A pending callback doesn't stop the function from returning though - that's the basic concept of asynchrony and concurrency in JS – Bergi Sep 16 '14 at 16:26
  • @Bergi good. I was not concerned about map returning undefineds, rather the memory taken by those pending callbacks which would actually do the operations I need to be done on the list. Anyway - thanks for pointing me in the right directions - I will test it out. – Maciej Jankowski Sep 16 '14 at 16:29

2 Answers2

1

If you're worried about submitting a million http requests in a very short time span (which you probably should be), you'll have to set up a buffer of some kind.

one simple way to do it:

var lines = fs.readFileSync('./mailinglist.txt').split("\n");

var entryIdx = 0;
var done = false;

var processNextEntry = function () {
    if (entryIdx < lines.length) {
        processEntry(lines[entryIdx++]);
    } else {
        done = true;
    }
};

var processEntry = function _processEntry(i){
  i = i.split('\t');
  getEmailBody( function(emailBody, name){
      var msg = {
      "message" :  emailBody,
      "name" : name
      }
      request(msg, function reqCb(err, result){
        // ...
        !done && processNextEntry();
      });
  }); // getEmailBody
}

// getEmailBody didn't change

// you set the ball rolling by calling processNextEntry n times,
// where n is a sensible number of http requests to have pending at once.

for (var i=0; i<10; i++) processNextEntry();

Edit: according to this blog post node has an internal queue system, it will only allow 5 simultaneous requests. But you can still use this method to avoid filling up that internal queue with a million items if you're worried about memory consumption.

d.j.sheldrick
  • 1,430
  • 1
  • 11
  • 16
1

Firstly I would advise against using readFileSync, and instead favour the async equivalent. Blocking on IO operations should be avoided as reading from a disk is very expensive, and whilst that's the sole purpose of your code now, I would consider how that might change in the future - and arbitrarily wasting clock cycles is never a good idea.

For large data files I would read them in in defined chunks and process them. If you can come up with some schema, either sentinels to distinguish data blocks within the file, or padding to boundaries, then process the file piece by piece.

This is just rough, untested off the top of my head, but something like:

var fs = require("fs");

function doMyCoolWork(startByteIndex, endByteIndex){

    fs.open("path to your text file", 'r', function(status, fd) {

            var chunkSize = endByteIndex - startByteIndex;
            var buffer = new Buffer(chunkSize);

            fs.read(fd, buffer, 0, chunkSize, 0, function(err, byteCount) {

                var data = buffer.toString('utf-8', 0, byteCount);

                // process your data here


                if(stillWorkToDo){
                    //recurse
                    doMyCoolWork(endByteIndex, endByteIndex + 100);
                }
            });
        });
}

Or look into one of the stream library functions for similar functionality.

H2H

ps. Javascript and Node works extremely well with async and eventing.. using sync is an antipattern in my opinion, and likely to cause code to be a headache in future

Darius
  • 5,180
  • 5
  • 47
  • 62
  • I totally get your point about async read and then process data as it comes. I was rather worried that a million of hanging functions with pending callback would cause a huge memory usage bump. – Maciej Jankowski Sep 16 '14 at 15:46
  • My file is `\n` delimited and as far as I know node does not emit `new line` event so my choices for processing are somewhat limited. – Maciej Jankowski Sep 16 '14 at 15:56
  • 1
    You could either take an arbitrary block size from the file each time (eg i = 100 bytes), and find the index of the nearest '\n' character before the ith byte in that chunk (say 'n'), then the next read from the file would start at (i - n) etc., or you can pad all rows to be the same number of bytes long, which is similar to what happens in HTTP protocol for packets – Darius Sep 16 '14 at 16:25
  • aha! got it here: http://stackoverflow.com/a/15554600/678074 - it seems there IS a readline mechanism! although not perfect. – Maciej Jankowski Sep 16 '14 at 16:26
  • 1
    It seems the limitations of it suit your case of only needing '\n' fortunately, and behind the scenes it is probably doing some similar buffered iteration over the file stream. Good luck with it :) – Darius Sep 16 '14 at 16:27