1

I'm using Node to process log files from an application and due to the traffic volumes these can be a gigabyte or so in size each day.

The files are gripped every night and I need to read the files without having to unzip them to disk.

From what I understand I can use zlib to decompress the file to some form of stream but I don't know how to get at the data and not sure how i can then easily handle a line at a time (though I know some kind of while loop searching for \n will be involved.

The closest answer I found so far was demonstrating how to pipe the stream to a sax parser, but the whole node pipes/stream is a little confusing

fs.createReadStream('large.xml.gz').pipe(zlib.createUnzip()).pipe(saxStream);
Zac Tolley
  • 2,340
  • 4
  • 19
  • 22
  • Have you considered writing a native extension and using a C++ library? If your files are that large, this might be the best option... – MobA11y Jul 02 '13 at 20:22
  • Don't know C++ tbh. Currently I can do it by unzipping the file and then using deadline, but when I roll that into the production environment the permissions are locked down so I can't change the contents of the log folder, only read from it. – Zac Tolley Jul 02 '13 at 22:17
  • Try executing your node process with sudo? – MobA11y Jul 03 '13 at 01:40
  • To parse file line by line you can see here http://stackoverflow.com/a/16013228/568109. You will have to pass decompressed stream though. – user568109 Jul 03 '13 at 04:30
  • Not really a good security practice to run a service as sudo – Zac Tolley Jul 03 '13 at 07:48
  • I do this already (meant deadline not deadline) but the security issue means I'll need to change to read the files direct... otherwise I'll have to decompress to /tmp – Zac Tolley Jul 03 '13 at 07:49

1 Answers1

1

You should take a look at sax. It is developed by the isaacs!

I haven't tested this code, but I would start by writing something along these lines.

var Promise = Promise || require('es6-promise').Promise
, thr = require('through2')
, createReadStream = require('fs').createReadStream
, createUnzip = require('zlib').createUnzip
, createParser = require('sax').createStream
;

function processXml (filename) {
  return new Promise(function(resolve, reject){
    var unzip = createUnzip()
    , xmlParser = createParser()
    ;

    xmlParser.on('opentag', function(node){
      // do stuff with the node
    })
    xmlParser.on('attribute', function(node){
      // do more stuff with attr 
    })

    // instead of rejecting, you may handle the error instead.
    xmlParser.on('error', reject) 
    xmlParser.on('end', resolve)

    createReadStream(filename)
    .pipe(unzip)
    .pipe(xmlParser)
    .pipe(thr(function(chunk, enc, next){
      // as soon xmlParser is done with a node, it passes down stream.
      // change the chunk if you wish
      next(null, newerChunk)
    }))

    rl = readline.createInterface({
      input: unzip
    , ouput: xmlParser
    })
  })
}

processXml('large.xml.gz').then(function(){
  console.log('done')
})
.catch(function(err){
  // handle error.
})

I hope that helps

markuz-gj
  • 219
  • 1
  • 8