3

I currently have a nodejs script that reads data from CSV files and then writes to a number of different CSVs based on the data in each row.

There are 300 CSVs (around 40Gbs worth) to process so I added async to my script to read and write the data simultaneously across all cores.

async.mapLimit(filePaths, 4, streamZip, function (err, results) {
    console.log('finished');
});

But it turns out that is not what async does. This code actually takes more time to complete than processing each file individually because it is using just a single core.

There seems to many different ways to use more cores cluster, child process, web workers and worker-farm

There have also been other questions asked like this one

But they all seem to want to use HTTP or Express and run as a server, or they call on an external program like 'ls'. Rather than just running a multi processing pool like I would in Python.

Can anyone provide an example or help on how to use threads or processes that would read multiple CSV files in parallel and all write to the same fs.createWriteStreams??

Thanks

More of my code is here:

function streamZip(filePath, callback) {
var stream = fs.createReadStream(filePath)
    .pipe(unzip.Parse())
    .on('entry', function (entry) {
        var fileName = entry.path;
        entry.pipe(csvStream)
    })

var csvStream = csv()
    .on("data", function(data){
        var identifier = data[0];
        if (identifier === '10') {
            10CSV.write(data)
        } else if (identifier === '11') {
            11CSV.write(data)
        } else if (identifier === '15') {
            15CSV.write(data)
        }
    })
    .on("end", function(){
        callback(null, filePath + 'Processed');
    });
}
Community
  • 1
  • 1
tjmgis
  • 1,589
  • 4
  • 23
  • 42
  • Are you reading from a single disk? If you are, reading multiple files simultaneously from the same disk actually creates more overhead (as the disk heads will have to jump from file to file) and cause it to be slower, making multi-core processing actually worse in this case. – Olivercodes Oct 22 '15 at 19:28
  • @Kinetics yes they are all on the same SSD. That is very interesting but in python using a multiprocessing pool it is so much quicker than a single thread reading and writing data on the same disk. Thought NodeJS should be able to do the same – tjmgis Oct 23 '15 at 07:32
  • Yeah it would be fine with an SSD. No head moving around :) – Olivercodes Oct 23 '15 at 15:23
  • I have some time later, I'll post a multiprocessor example for you – Olivercodes Oct 23 '15 at 15:24
  • @Kinetics that would be incredible useful if you get a chance – tjmgis Oct 25 '15 at 08:13

0 Answers0