Pass large array to node child process

Question

I have complex CPU intensive work I want to do on a large array. Ideally, I'd like to pass this to the child process.

var spawn = require('child_process').spawn;

// dataAsNumbers is a large 2D array
var child = spawn(process.execPath, ['/child_process_scripts/getStatistics', dataAsNumbers]);

child.stdout.on('data', function(data){
  console.log('from child: ', data.toString());
});

But when I do, node gives the error:

spawn E2BIG

I came across this article

So piping the data to the child process seems to be the way to go. My code is now:

var spawn = require('child_process').spawn;

console.log('creating child........................');

var options = { stdio: [null, null, null, 'pipe'] };
var args = [ '/getStatistics' ];
var child = spawn(process.execPath, args, options);

var pipe = child.stdio[3];

pipe.write(Buffer('awesome'));

child.stdout.on('data', function(data){
  console.log('from child: ', data.toString());
});

And then in getStatistics.js:

console.log('im inside child');

process.stdin.on('data', function(data) {
  console.log('data is ', data);
  process.exit(0);
});

However the callback in process.stdin.on isn't reached. How can I receive a stream in my child script?

EDIT

I had to abandon the buffer approach. Now I'm sending the array as a message:

var cp = require('child_process');
var child = cp.fork('/getStatistics.js');

child.send({ 
  dataAsNumbers: dataAsNumbers
});

But this only works when the length of dataAsNumbers is below about 20,000, otherwise it times out.

node is not the right tool for this type of work. i would rather recommend you use a multithreaded language. — arboreal84, May 19 '17 at 10:05
The project is 90% complete, I wont be changing from node now. There are plenty of articles explaining heavy CPU usage with node — Mark, May 20 '17 at 09:27
Usually it is a good idea to start a project solving the core problems first. In a multithreaded language you would not need to copy data around since threads share memory. Copying data in this case will slow down everything. In addition to that, node is fast when you delegate the work to libuv. If you plan to use the v8 portion of node for heavy processing then it will not be fast. Plus, if for any reason this is a part of an actual server, your event loop will block and the I/O will starve making all your requests time out. — arboreal84, May 20 '17 at 09:31
I appreciate that but there are ways around this e.g. http://neilk.net/blog/2013/04/30/why-you-should-use-nodejs-for-CPU-bound-tasks/ — Mark, May 20 '17 at 18:23
About how many elements will this array usually have? Also, am I correct in assuming that it contains regular JavaScript `Number`s? — rvighne, May 20 '17 at 19:15
@rvighne Can be up to 1 million entries in array, and each element is an array itself with up to 20 entries. The arrays are all floating numbers — Mark, May 22 '17 at 18:01
@Mark: I just confirmed that [my answer](https://stackoverflow.com/a/44091211/1079573) works on arrays of 20 million 64-bit floats and only takes 300ms (including filling the array). — rvighne, May 23 '17 at 05:04

rvighne · Accepted Answer · 2017-05-23T04:58:17.277

With such a massive amount of data, I would look into using shared memory rather than copying the data into the child process (which is what is happening when you use a pipe or pass messages). This will save memory, take less CPU time for the parent process, and be unlikely to bump into some limit.

shm-typed-array is a very simple module that seems suited to your application. Example:

parent.js

"use strict";

const shm = require('shm-typed-array');
const fork = require('child_process').fork;

// Create shared memory
const SIZE = 20000000;
const data = shm.create(SIZE, 'Float64Array');

// Fill with dummy data
Array.prototype.fill.call(data, 1);

// Spawn child, set up communication, and give shared memory
const child = fork("child.js");
child.on('message', sum => {
    console.log(`Got answer: ${sum}`);

    // Demo only; ideally you'd re-use the same child
    child.kill();
});
child.send(data.key);

child.js

"use strict";

const shm = require('shm-typed-array');

process.on('message', key => {
    // Get access to shared memory
    const data = shm.get(key, 'Float64Array');

    // Perform processing
    const sum = Array.prototype.reduce.call(data, (a, b) => a + b, 0);

    // Return processed data
    process.send(sum);
});

Note that we are only sending a small "key" from the parent to the child process through IPC, not the whole data. Thus, we save a ton of memory and time.

Of course, you can change 'Float64Array' (e.g. a double) to whatever typed array your application requires. Note that this library in particular only handles single-dimensional typed arrays; but that should only be a minor obstacle.

score 1 · Answer 2 · answered May 20 '17 at 19:25

I too was able to reproduce the delay your were experiencing, but maybe not as bad as you. I used the following

// main.js
const fork = require('child_process').fork

const child = fork('./getStats.js')

const dataAsNumbers = Array(100000).fill(0).map(() =>
  Array(100).fill(0).map(() => Math.round(Math.random() * 100)))

child.send({
  dataAsNumbers: dataAsNumbers,
})

And

// getStats.js
process.on('message', function (data) {
  console.log('data is ', data)
  process.exit(0)
})

node main.js 2.72s user 0.45s system 103% cpu 3.045 total

I'm generating 100k elements composed of 100 numbers to mock your data, make sure you are using the message event on process. But maybe your children are more complex and might be the reason of the failure, also depends on the timeout you set on your query.

If you want to get better results, what you could do is chunk your data into multiple pieces that will be sent to the child process and reconstructed to form the initial array.

Also one possibility would be to use a third-party library or protocol, even if it's a bit more work. You could have a look to messenger.js or even something like an AMQP queue that could allow you to communicate between the two process with a pool and a guaranty of the message been acknowledged by the sub process. There is a few node implementations of it, like amqp.node, but it would still require a bit of setup and configuration work.

Thanks, I had a different problem, but your answer helped me to fix my issue. [Running karma tests in teamcity](https://stackoverflow.com/questions/50257826/concurrently-node-exits-with-status-1-this-halts-teamcity-leading-it-to-believe) — Adrian Moisa, May 11 '18 at 13:03

score 0 · Answer 3 · answered May 21 '17 at 07:28

0

Use an in memory cache like https://github.com/ptarjan/node-cache, and let the parent process store the array contents with some key, the child process would retreive the contents through that key.

answered May 21 '17 at 07:28

DhruvPathak

42,059
16
116
175

score 0 · Answer 4 · edited May 23 '17 at 12:26

You could consider using OS pipes you'll find a gist here as an input to your node child application.

I know this is not exactly what you're asking for, but you could use the cluster module (included in node). This way you can get as many instances as cores you machine has to speed up processing. Moreover consider using streams if you don't need to have all the data available before you start processing. If the data to be processed is too large i would store it in a file so you can reinilize if there is any error during the process. Here is an example of clustering.

var cluster = require('cluster');
var numCPUs = 4;

if (cluster.isMaster) {
    for (var i = 0; i < numCPUs; i++) {
        var worker = cluster.fork();
        console.log('id', worker.id)
    }
} else {
    doSomeWork()
}

function doSomeWork(){
    for (var i=1; i<10; i++){
        console.log(i)
    }
}

More info sending messages across workers question 8534462.

score 0 · Answer 5 · answered May 22 '17 at 02:02

Why do you want to make a subprocess? The sending of the data across subprocesses is likely to cost more in terms of cpu and realtime than you will save in making the processing happen within the same process.

Instead, I would suggest that for super efficient coding you consider to do your statistics calculations in a worker thread that runs within the same memory as the nodejs main process.

You can use the NAN to write C++ code that you can post to a worker thread, and then have that worker thread to post the result and an event back to your nodejs event loop when done.

The benefit of this is that you don't need extra time to send the data across to a different process, but the downside is that you will write a bit of C++ code for the threaded action, but the NAN extension should take care of most of the difficult task for you.

You will be using multiple cores if you push it to a worker thread — Soren, May 26 '17 at 00:34

C'Reality Education · Answer 6 · 2022-09-08T19:22:30.250

0

To address the performance issue while passing large data to the child process, save the data to the .json or .txt file and pass only the filename to the childprocess. I've achieved 70% performance improvement with this approach.

edited Sep 08 '22 at 19:22

answered Sep 08 '22 at 18:48

C'Reality Education

1
1
3

This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/32676735) – kavigun Sep 15 '22 at 08:21

rkmax · Answer 7 · 2017-05-20T19:04:23.813

For long process tasks you could use something like gearman You could do the heavy work process on workers, in this way you can setup how many workers you need, for example I do some file processing in this way, if I need scale you create more worker instance, also I have different workers for different tasks, process zip files, generate thumbnails, etc, the good of this is the workers can be written on any language node.js, Java, python and can be integrated on your project with ease

// worker-unzip.js
const debug = require('debug')('worker:unzip');
const {series, apply} = require('async');
const gearman = require('gearmanode');
const {mkdirpSync} = require('fs-extra');
const extract = require('extract-zip');

module.exports.unzip = unzip;
module.exports.worker = worker;

function unzip(inputPath, outputDirPath, done) {
  debug('unzipping', inputPath, 'to', outputDirPath);
  mkdirpSync(outputDirPath);
  extract(inputPath, {dir: outputDirPath}, done);
}


/**
 *
 * @param {Job} job
 */
function workerUnzip(job) {
  const {inputPath, outputDirPath} = JSON.parse(job.payload);
  series([
    apply(unzip, inputPath, outputDirPath),
    (done) => job.workComplete(outputDirPath)
  ], (err) => {
    if (err) {
      console.error(err);
      job.reportError();
    }
  });
}

function worker(config) {
  const worker = gearman.worker(config);
  if (config.id) {
    worker.setWorkerId(config.id);
  }

  worker.addFunction('unzip', workerUnzip, {timeout: 10, toStringEncoding: 'ascii'});
  worker.on('error', (err) => console.error(err));

  return worker;
}

a simple index.js

const unzip = require('./worker-unzip').worker;

unzip(config); // pass host and port of the Gearman server

I normally run workers with PM2

the integration with your code it's very easy. something like

//initialize
const gearman = require('gearmanode');

gearman.Client.logger.transports.console.level = 'error';
const client = gearman.client(configGearman); // same host and port

the just add work to the queue passing the name of the functions

const taskpayload = {inputPath: '/tmp/sample-file.zip', outputDirPath: '/tmp/unzip/sample-file/'}
const job client.submitJob('unzip', JSON.stringify(taskpayload));
job.on('complete', jobCompleteCallback);
job.on('error', jobErrorCallback);

Pass large array to node child process

7 Answers7

Linked