149

I need to do some parsing of large (5-10 Gb)logfiles in Javascript/Node.js (I'm using Cube).

The logline looks something like:

10:00:43.343423 I'm a friendly log message. There are 5 cats, and 7 dogs. We are in state "SUCCESS".

We need to read each line, do some parsing (e.g. strip out 5, 7 and SUCCESS), then pump this data into Cube (https://github.com/square/cube) using their JS client.

Firstly, what is the canonical way in Node to read in a file, line by line?

It seems to be fairly common question online:

A lot of the answers seem to point to a bunch of third-party modules:

However, this seems like a fairly basic task - surely, there's a simple way within the stdlib to read in a textfile, line-by-line?

Secondly, I then need to process each line (e.g. convert the timestamp into a Date object, and extract useful fields).

What's the best way to do this, maximising throughput? Is there some way that won't block on either reading in each line, or on sending it to Cube?

Thirdly - I'm guessing using string splits, and the JS equivalent of contains (IndexOf != -1?) will be a lot faster than regexes? Has anybody had much experience in parsing massive amounts of text data in Node.js?

starball
  • 20,030
  • 7
  • 43
  • 238
victorhooi
  • 16,775
  • 22
  • 90
  • 113
  • I built a log parser in node that takes a bunch of regex strings with 'captures' built in and outputs to JSON. You can even call functions on each capture if you want to do a calc. It might do what you want: **https://npmjs.org/package/logax** – Jess Jan 17 '14 at 17:04
  • A better comparison https://betterprogramming.pub/a-memory-friendly-way-of-reading-files-in-node-js-a45ad0cc7bb6 – yashodha_h Nov 14 '21 at 13:09

13 Answers13

248

I searched for a solution to parse very large files (gbs) line by line using a stream. All the third-party libraries and examples did not suit my needs since they processed the files not line by line (like 1 , 2 , 3 , 4 ..) or read the entire file to memory

The following solution can parse very large files, line by line using stream & pipe. For testing I used a 2.1 gb file with 17.000.000 records. Ram usage did not exceed 60 mb.

First, install the event-stream package:

npm install event-stream

Then:

var fs = require('fs')
    , es = require('event-stream');

var lineNr = 0;

var s = fs.createReadStream('very-large-file.csv')
    .pipe(es.split())
    .pipe(es.mapSync(function(line){

        // pause the readstream
        s.pause();

        lineNr += 1;

        // process line here and call s.resume() when rdy
        // function below was for logging memory usage
        logMemoryUsage(lineNr);

        // resume the readstream, possibly from a callback
        s.resume();
    })
    .on('error', function(err){
        console.log('Error while reading file.', err);
    })
    .on('end', function(){
        console.log('Read entire file.')
    })
);

enter image description here

Please let me know how it goes!

jameshfisher
  • 34,029
  • 31
  • 121
  • 167
Gerard
  • 3,108
  • 1
  • 19
  • 21
  • 8
    FYI, this code isn't synchronous. It's asynchronous. If you insert `console.log(lineNr)` after the last line of your code, it will not show the final line count because the file is read asynchronously. – jfriend00 Jun 16 '15 at 23:19
  • what is logMemoryUsage(lineNr); or how to get data of the lineNr in the file if i try to print lineNr with my file i am getting only number 2 and there are 6000 records – Labeo Oct 07 '15 at 17:32
  • Hi Labeo, it was a function I used to log memory usage but it's not included in the code. You can remove that line from the code. – Gerard Oct 12 '15 at 12:24
  • 6
    Thank you, this was the only solution I could find that actually paused and resumed when it was supposed to. Readline didn't. – Brent Oct 14 '15 at 18:51
  • 3
    Awesome example, and it does actually pause. Additionally if you decide to stop the file read early you can use `s.end();` – zipzit Feb 23 '16 at 18:16
  • 1
    What's the benefit of have pause and resume? I don't think my typical uses would need that unless I'm missing something. – hippietrail Feb 24 '16 at 09:14
  • I was wondering what's the point of having resume inside an anonymous function? – ambodi Mar 01 '16 at 09:28
  • @ambodi it served no purpose, I removed the function. – Gerard Mar 09 '16 at 13:14
  • 1
    @ambodi, I believe the purpose of pause/resume is to allow other asynchronous processes to happen before continuing to read from the file. – jchook May 23 '16 at 14:48
  • 2
    Worked like a charm. Used it to index 150 million documents to elasticsearch index. `readline` module is a pain. It does not pause and was causing failure everytime after 40-50 million. Wasted a day. Thanks a lot for the answer. This one works perfectly – Mandeep Singh Jun 07 '16 at 17:42
  • I use this for parsing log files, and for lines that I skip it can get ahead of itself and result in a stackoverflow. If there isn't work for every line be wary of this, because it might end up failing. – David Oct 04 '16 at 21:27
  • In case you run out of heap memory you can use below command format node --max_old_space_size=4096 app.js – Prabhat Nov 16 '16 at 02:55
  • 1
    The `pause` and `result` would only be useful if `resume` was called from a callback, or if the containing function was `async` and `await`s were used between the `pause` and `resume` methods –  Apr 07 '17 at 12:29
  • How I'll be writing to another stream. My requirement is to read heavy JSON data from db in GB's and write it to excel file. If using your code then i need to push data into writable excel stream. please guide @Gerard – Amulya Kashyap Sep 25 '17 at 17:38
  • 3
    event-stream was compromised: https://medium.com/intrinsic/compromised-npm-package-event-stream-d47d08605502 but 4+ is apparently safe https://blog.npmjs.org/post/180565383195/details-about-the-event-stream-incident – John Vandivier Sep 02 '19 at 00:25
  • Not one of ```s.end(); s.destroy(); s.emit('end'); ``` will stop stream from reading. It takes 25sec to read 1 line of 5Gig file from a SSD drive (because the stream reader does not stop). See line-by-line answer for a solution that did this in 1ms (using ```s.cancel()```). – Perez Lamed van Niekerk Jan 14 '20 at 11:18
  • You are a god, please name my daughter. – user2081518 Jun 14 '20 at 13:06
88

You can use the inbuilt readline package, see docs here. I use stream to create a new output stream.

    var fs = require('fs'),
        readline = require('readline'),
        stream = require('stream');
    
    var instream = fs.createReadStream('/path/to/file');
    var outstream = new stream;
    outstream.readable = true;
    outstream.writable = true;
    
    var rl = readline.createInterface({
        input: instream,
        output: outstream,
        terminal: false
    });
    
    rl.on('line', function(line) {
        console.log(line);
        //Do your stuff ...
        //Then write to output stream
        rl.write(line);
    });

Large files will take some time to process. Do tell if it works.

Lee Goddard
  • 10,680
  • 4
  • 46
  • 63
user568109
  • 47,225
  • 17
  • 99
  • 123
  • 2
    As written, the second to last line fails because cubestuff is not defined. – Greg Dec 17 '14 at 17:57
  • 2
    Using `readline`, is it possible to pause/resume the read stream to perform async actions in the "do stuff" area? – jchook May 23 '16 at 15:01
  • 3
    @jchook `readline` was giving me a lot of problems when I tried pause/resume. It does not pause the stream properly creating a lot of problem if the downstream process is slower – Mandeep Singh Jun 07 '16 at 17:44
35

I really liked @gerard answer which is actually deserves to be the correct answer here. I made some improvements:

  • Code is in a class (modular)
  • Parsing is included
  • Ability to resume is given to the outside in case there is an asynchronous job is chained to reading the CSV like inserting to DB, or a HTTP request
  • Reading in chunks/batche sizes that user can declare. I took care of encoding in the stream too, in case you have files in different encoding.

Here's the code:

'use strict'

const fs = require('fs'),
    util = require('util'),
    stream = require('stream'),
    es = require('event-stream'),
    parse = require("csv-parse"),
    iconv = require('iconv-lite');

class CSVReader {
  constructor(filename, batchSize, columns) {
    this.reader = fs.createReadStream(filename).pipe(iconv.decodeStream('utf8'))
    this.batchSize = batchSize || 1000
    this.lineNumber = 0
    this.data = []
    this.parseOptions = {delimiter: '\t', columns: true, escape: '/', relax: true}
  }

  read(callback) {
    this.reader
      .pipe(es.split())
      .pipe(es.mapSync(line => {
        ++this.lineNumber

        parse(line, this.parseOptions, (err, d) => {
          this.data.push(d[0])
        })

        if (this.lineNumber % this.batchSize === 0) {
          callback(this.data)
        }
      })
      .on('error', function(){
          console.log('Error while reading file.')
      })
      .on('end', function(){
          console.log('Read entirefile.')
      }))
  }

  continue () {
    this.data = []
    this.reader.resume()
  }
}

module.exports = CSVReader

So basically, here is how you will use it:

let reader = CSVReader('path_to_file.csv')
reader.read(() => reader.continue())

I tested this with a 35GB CSV file and it worked for me and that's why I chose to build it on @gerard's answer, feedbacks are welcomed.

Community
  • 1
  • 1
ambodi
  • 6,116
  • 2
  • 32
  • 22
25

I used https://www.npmjs.com/package/line-by-line for reading more than 1 000 000 lines from a text file. In this case, an occupied capacity of RAM was about 50-60 megabyte.

    const LineByLineReader = require('line-by-line'),
    lr = new LineByLineReader('big_file.txt');

    lr.on('error', function (err) {
         // 'err' contains error object
    });

    lr.on('line', function (line) {
        // pause emitting of lines...
        lr.pause();

        // ...do your asynchronous line processing..
        setTimeout(function () {
            // ...and continue emitting lines.
            lr.resume();
        }, 100);
    });

    lr.on('end', function () {
         // All lines are read, file is closed now.
    });
Eugene Ilyushin
  • 602
  • 9
  • 14
  • 1
    'line-by-line' is more memory efficient than the selected answer. For 1 million lines in a csv the selected answer had my node process in the low 800s of megabytes. Using 'line-by-line' it was consistently in the low 700s. This module also keeps the code clean and easy to read. In total I will need to read about 18 million so every mb counts! – Neo Aug 27 '17 at 01:26
  • 1
    it's a shame this uses it's own event 'line' instead of the standard 'chunk', meaning you won't be able to make use of 'pipe'. – Rene Wooller Sep 21 '17 at 23:02
  • After hours of testing and searching this is the only solution that actually stop on ```lr.cancel()``` method. Reads first 1000 lines of a 5Gig file in 1ms. Awesome!!!! – Perez Lamed van Niekerk Jan 14 '20 at 11:07
18

The Node.js Documentation offers a very elegant example using the Readline module.

Example: Read File Stream Line-by-Line

const { once } = require('node:events');
const fs = require('fs');
const readline = require('readline');

const rl = readline.createInterface({
    input: fs.createReadStream('sample.txt'),
    crlfDelay: Infinity
});

rl.on('line', (line) => {
    console.log(`Line from file: ${line}`);
});

await once(rl, 'close');

Note: we use the crlfDelay option to recognize all instances of CR LF ('\r\n') as a single line break.

Lee Goddard
  • 10,680
  • 4
  • 46
  • 63
Jaime Gómez
  • 6,961
  • 3
  • 40
  • 41
  • In my case, I want to show the entire text in an HTML using an element's `innerHTML`, but the last line is always cut off, even if I have `overflow: auto` in my css. What's wrong? – kakyo Sep 11 '20 at 04:01
  • OK, I got it. I got to use a bigger `padding-bottom` than my `padding` parameter. – kakyo Sep 11 '20 at 04:04
  • Can you explain the purpose of using 'readline'. Why can't we just do it using readStream? – Apoorva Ambhoj Sep 12 '22 at 16:22
8

Apart from read the big file line by line, you also can read it chunk by chunk. For more refer to this article

var offset = 0;
var chunkSize = 2048;
var chunkBuffer = new Buffer(chunkSize);
var fp = fs.openSync('filepath', 'r');
var bytesRead = 0;
while(bytesRead = fs.readSync(fp, chunkBuffer, 0, chunkSize, offset)) {
    offset += bytesRead;
    var str = chunkBuffer.slice(0, bytesRead).toString();
    var arr = str.split('\n');

    if(bytesRead = chunkSize) {
        // the last item of the arr may be not a full line, leave it to the next chunk
        offset -= arr.pop().length;
    }
    lines.push(arr);
}
console.log(lines);
LF00
  • 27,015
  • 29
  • 156
  • 295
  • 3
    Could it be, that the following should be a comparison instead of an assignment: `if(bytesRead = chunkSize)`? – Stefan Rein Jan 14 '20 at 09:17
4

I had the same problem yet. After comparing several modules that seem to have this feature, I decided to do it myself, it's simpler than I thought.

gist: https://gist.github.com/deemstone/8279565

var fetchBlock = lineByline(filepath, onEnd);
fetchBlock(function(lines, start){ ... });  //lines{array} start{int} lines[0] No.

It cover the file opened in a closure, that fetchBlock() returned will fetch a block from the file, end split to array (will deal the segment from last fetch).

I've set the block size to 1024 for each read operation. This may have bugs, but code logic is obvious, try it yourself.

Wladimir Palant
  • 56,865
  • 12
  • 98
  • 126
deemstone
  • 170
  • 1
  • 6
4

Reading / Writing files using stream with the native nodejs modules (fs, readline):

const fs = require('fs');
const readline = require('readline');

const rl = readline.createInterface({
                                       input:  fs.createReadStream('input.json'),
                                       output: fs.createWriteStream('output.json')
                                    });

rl.on('line', function(line) {
    console.log(line);

    // Do any 'line' processing if you want and then write to the output file
    this.output.write(`${line}\n`);
});

rl.on('close', function() {
    console.log(`Created "${this.output.path}"`);
});
SridharKritha
  • 8,481
  • 2
  • 52
  • 43
3

Based on this questions answer I implemented a class you can use to read a file synchronously line-by-line with fs.readSync(). You can make this "pause" and "resume" by using a Q promise (jQuery seems to require a DOM so cant run it with nodejs):

var fs = require('fs');
var Q = require('q');

var lr = new LineReader(filenameToLoad);
lr.open();

var promise;
workOnLine = function () {
    var line = lr.readNextLine();
    promise = complexLineTransformation(line).then(
        function() {console.log('ok');workOnLine();},
        function() {console.log('error');}
    );
}
workOnLine();

complexLineTransformation = function (line) {
    var deferred = Q.defer();
    // ... async call goes here, in callback: deferred.resolve('done ok'); or deferred.reject(new Error(error));
    return deferred.promise;
}

function LineReader (filename) {      
  this.moreLinesAvailable = true;
  this.fd = undefined;
  this.bufferSize = 1024*1024;
  this.buffer = new Buffer(this.bufferSize);
  this.leftOver = '';

  this.read = undefined;
  this.idxStart = undefined;
  this.idx = undefined;

  this.lineNumber = 0;

  this._bundleOfLines = [];

  this.open = function() {
    this.fd = fs.openSync(filename, 'r');
  };

  this.readNextLine = function () {
    if (this._bundleOfLines.length === 0) {
      this._readNextBundleOfLines();
    }
    this.lineNumber++;
    var lineToReturn = this._bundleOfLines[0];
    this._bundleOfLines.splice(0, 1); // remove first element (pos, howmany)
    return lineToReturn;
  };

  this.getLineNumber = function() {
    return this.lineNumber;
  };

  this._readNextBundleOfLines = function() {
    var line = "";
    while ((this.read = fs.readSync(this.fd, this.buffer, 0, this.bufferSize, null)) !== 0) { // read next bytes until end of file
      this.leftOver += this.buffer.toString('utf8', 0, this.read); // append to leftOver
      this.idxStart = 0
      while ((this.idx = this.leftOver.indexOf("\n", this.idxStart)) !== -1) { // as long as there is a newline-char in leftOver
        line = this.leftOver.substring(this.idxStart, this.idx);
        this._bundleOfLines.push(line);        
        this.idxStart = this.idx + 1;
      }
      this.leftOver = this.leftOver.substring(this.idxStart);
      if (line !== "") {
        break;
      }
    }
  }; 
}
Balthazar Rouberol
  • 6,822
  • 2
  • 35
  • 41
Benvorth
  • 7,416
  • 8
  • 49
  • 70
2

node-byline uses streams, so i would prefer that one for your huge files.

for your date-conversions i would use moment.js.

for maximising your throughput you could think about using a software-cluster. there are some nice-modules which wrap the node-native cluster-module quite well. i like cluster-master from isaacs. e.g. you could create a cluster of x workers which all compute a file.

for benchmarking splits vs regexes use benchmark.js. i havent tested it until now. benchmark.js is available as a node-module

hereandnow78
  • 14,094
  • 8
  • 42
  • 48
  • 1
    Note `moment.js` nowadays has fallen out of favor due to significant performance concerns, namely: its gargantuan footprint, inability to tree shake, and deeply entrenched but now widely disliked mutability. Even [its own devs](https://momentjs.com/docs/) have all but written it off. Some good alternatives are `date-fns` and `day.js`; here's an article with more details: – Ezekiel Victor Feb 13 '22 at 01:25
1
import * as csv from 'fast-csv';
import * as fs from 'fs';
interface Row {
  [s: string]: string;
}
type RowCallBack = (data: Row, index: number) => object;
export class CSVReader {
  protected file: string;
  protected csvOptions = {
    delimiter: ',',
    headers: true,
    ignoreEmpty: true,
    trim: true
  };
  constructor(file: string, csvOptions = {}) {
    if (!fs.existsSync(file)) {
      throw new Error(`File ${file} not found.`);
    }
    this.file = file;
    this.csvOptions = Object.assign({}, this.csvOptions, csvOptions);
  }
  public read(callback: RowCallBack): Promise < Array < object >> {
    return new Promise < Array < object >> (resolve => {
      const readStream = fs.createReadStream(this.file);
      const results: Array < any > = [];
      let index = 0;
      const csvStream = csv.parse(this.csvOptions).on('data', async (data: Row) => {
        index++;
        results.push(await callback(data, index));
      }).on('error', (err: Error) => {
        console.error(err.message);
        throw err;
      }).on('end', () => {
        resolve(results);
      });
      readStream.pipe(csvStream);
    });
  }
}
import { CSVReader } from '../src/helpers/CSVReader';
(async () => {
  const reader = new CSVReader('./database/migrations/csv/users.csv');
  const users = await reader.read(async data => {
    return {
      username: data.username,
      name: data.name,
      email: data.email,
      cellPhone: data.cell_phone,
      homePhone: data.home_phone,
      roleId: data.role_id,
      description: data.description,
      state: data.state,
    };
  });
  console.log(users);
})();
Raza
  • 3,147
  • 2
  • 31
  • 35
0

Inspired by @gerard 's answer, and I want to provide a controlled way of reading chunk by chunk.

I have an electron app, which read multiple large log files chunk by chunk on user's request, the next chunk will only be requested when user asking for it.

Here is my LogReader class

// A singleton class, used to read log chunk by chunk
import * as fs from 'fs';
import { logDirPath } from './mainConfig';
import * as path from 'path';

type ICallback = (data: string) => Promise<void> | void;

export default class LogReader {
  filenames: string[];
  readstreams: fs.ReadStream[];
  chunkSize: number;
  lineNumber: number;
  data: string;

  static instance: LogReader;

  private constructor(chunkSize = 10240) {
    this.chunkSize = chunkSize || 10240; // default to 10kB per chunk
    this.filenames = [];
    // collect all log files and sort from latest to oldest
    fs.readdirSync(logDirPath).forEach((file) => {
      if (file.endsWith('.log')) {
        this.filenames.push(path.join(logDirPath, file));
      }
    });

    this.filenames = this.filenames.sort().reverse();
    this.lineNumber = 0;
  }

  static getInstance() {
    if (!this.instance) {
      this.instance = new LogReader();
    }

    return this.instance;
  }

  // read a chunk from a log file
  read(fileIndex: number, chunkIndex: number, cb: ICallback) {
    // file index out of range, return "end of all files"
    if (fileIndex >= this.filenames.length) {
      cb('EOAF');
      return;
    }

    const chunkSize = this.chunkSize;
    fs.createReadStream(this.filenames[fileIndex], {
      highWaterMark: chunkSize, // 1kb per read
      start: chunkIndex * chunkSize, // start byte of this chunk
      end: (chunkIndex + 1) * chunkSize - 1, // end byte of this chunk (end index was included, so minus 1)
    })
      .on('data', (data) => {
        cb(data.toString());
      })
      .on('error', (e) => {
        console.error('Error while reading file.');
        console.error(e);
        cb('EOF');
      })
      .on('end', () => {
        console.log('Read entire chunk.');
        cb('EOF');
      });
  }
}

Then to read chunk by chunk, the main process just need to call:

  const readLogChunk = (fileIndex: number, chunkIndex: number): Promise<string> => {
    console.log(`=== load log chunk ${fileIndex}: ${chunkIndex}====`);
    return new Promise((resolve) => {
      LogReader.getInstance().read(fileIndex, chunkIndex, (data) => resolve(data));
    });
  };

Keep increment chunkIndex to read chunk by chunk

When EOF is returned, means one file finished, just increment the fileIndex,

When EOAF is returned, means all files are read, just stop.

Alba Hoo
  • 154
  • 8
-1

I have made a node module to read large file asynchronously text or JSON. Tested on large files.

var fs = require('fs')
, util = require('util')
, stream = require('stream')
, es = require('event-stream');

module.exports = FileReader;

function FileReader(){

}

FileReader.prototype.read = function(pathToFile, callback){
    var returnTxt = '';
    var s = fs.createReadStream(pathToFile)
    .pipe(es.split())
    .pipe(es.mapSync(function(line){

        // pause the readstream
        s.pause();

        //console.log('reading line: '+line);
        returnTxt += line;        

        // resume the readstream, possibly from a callback
        s.resume();
    })
    .on('error', function(){
        console.log('Error while reading file.');
    })
    .on('end', function(){
        console.log('Read entire file.');
        callback(returnTxt);
    })
);
};

FileReader.prototype.readJSON = function(pathToFile, callback){
    try{
        this.read(pathToFile, function(txt){callback(JSON.parse(txt));});
    }
    catch(err){
        throw new Error('json file is not valid! '+err.stack);
    }
};

Just save the file as file-reader.js, and use it like this:

var FileReader = require('./file-reader');
var fileReader = new FileReader();
fileReader.readJSON(__dirname + '/largeFile.json', function(jsonObj){/*callback logic here*/});
Eyal Zoref
  • 25
  • 3