0

I want to process a large file line-by-line with Node.js. It's 100MB in size with 500,000 lines. I found this solution for reading lines in the input file

javascript - node.js: read a text file into an array. (Each line an item in the array.) - Stack Overflow

Now it's about writing each line into a new output file so I try

function readLines(input, func)
{
    var remaining = "";

    input.on("data", function(data)
    {
        remaining += data;
        var index = remaining.indexOf("\n");
        var last = 0;
        while (index > -1)
        {
            var line = remaining.substring(last, index);
            last = index + 1;
            func(line);
            index = remaining.indexOf("\n", last);
        }

        remaining = remaining.substring(last);
    });

    input.on("end", function()
    {
        if (remaining.length > 0)
        {
            func(remaining);
        }
    });
}

function write(data)
{
    var written = output.write(data);
}

var fs = require("fs");
var input = fs.createReadStream("input.txt");
var output = fs.createWriteStream("output.txt", {flags: "w"});
readLines(input, write);

However the script is really slow, it takes over 1 hour to process the input file completely and costs a lot of CPU and RAM usage (the amount of CPU is 25 and the amount of memory usage is up to 200MB). So can anybody tell me if there is any way to optimize it?

Community
  • 1
  • 1
Teiv
  • 2,605
  • 10
  • 39
  • 48
  • Did you check some of the other answers here? For example http://stackoverflow.com/questions/9486683/writing-large-files-with-node-js?rq=1 – mplungjan Mar 03 '13 at 06:32

1 Answers1

1

The problem you're facing is that you're constantly 1) appending to a string and 2) slicing a string. Both of these operations are probably causing new strings to be allocated and the old data to be copied across, this is a slow. The old strings are no longer referenced so are eventually freed up by the Garbage Collection but this takes time, hence the large memory usage.

There are simpler ways to do this of course but I assume you're wanting to learn how to do it using the streams in Node.JS. The general technique you can use to replace lots of appends and slices in this sort of situation is to accumulate your data in an array of strings. You can join an array of strings into a single array later on with mystring.join("") which would transform ["hello, ", "world"] into "hello, world". It's much faster to create an array of strings then join them all at once into a big string than it is to create the string my appending each string to the last.

Hope that helps and is enough for you to solve this problem and still learn something from it!

Thomas Parslow
  • 5,712
  • 4
  • 26
  • 33