is something wrong with how I read tgz files in Node.js? Benchmark says it is slow :(

Question

benchmark of my function:

mark@ichikawa:~/inbox/D3/read_logs$ time python countbytes.py
bytes: 277464

real    0m0.037s
user    0m0.036s
sys     0m0.000s
mark@ichikawa:~/inbox/D3/read_logs$ time node countbytes.js 
bytes: 277464

real    0m0.144s
user    0m0.120s
sys     0m0.032s

The measurements have been taken on a Ubuntu 13.04 x86_64 bit machine.

This is the simple version of my benchmark (I did 1000 iterations as well). I shows that the function that I wrote to read tgz files take more than 3x as long as a function I have written in Python.

For 1000 iterations filesize 277kB (I used process.hrtime and timeit):

Node:   30.608409032000015
Python:  6.84210395813

For 1000 iterations size 9.7MB:

Node:   590.491709309999
Python: 200.796745062

Please let me know if you have any idea on how to speed up reading the tgz files.

here is the code:

var fs = require('fs');
var tar = require('tar');
var zlib = require('zlib');
var Stream = require('stream');


var countBytes = new Stream;
countBytes.writable = true;
countBytes.count = 0;
countBytes.bytes = 0;

countBytes.write = function (buf) {
    countBytes.bytes += buf.length;
};

countBytes.end = function (buf) {
    if (arguments.length) countBytes.write(buf);

    countBytes.writable = false;
    console.log('bytes: ' + countBytes.bytes);
};

countBytes.destroy = function () {
    countBytes.writable = false;
};


fs.createReadStream('supercars-logs-13060317.tgz')
    .pipe(zlib.createUnzip())
    .pipe(tar.Extract({path: "responsetimes.log.13060317"}))
    .pipe(countBytes);

Any idea how to speed things up?

How large is `supercars-logs-13060317.tgz`? And have you tried comparing them on different size files? — Paul, Jun 17 '13 at 16:29
I'm curious to know if the time difference increases or remains and about 25 seconds for a much larger file. That should tell you if it's the extraction itself that is slower or the overhead involved with the extraction. — Paul, Jun 17 '13 at 20:11

score 0 · Answer 1 · answered May 18 '14 at 08:18

0

I looks good, but I'm curious of why are using tar stream?

And I would implement countBytes using a Transform instead. I like yo use through2 for this

var fs = require('fs')
, tar = require('tar')
, zlib = require('zlib')
, thr = require('through2')
, cache = {bytes: 0}
;
fs.createReadStream('supercars-logs-13060317.tgz')
  .pipe(zlib.createUnzip())
  .pipe(thr(function(chunk, enc, next){
    cache.bytes += chunk.length
    next(null, chunk)
  }))
  .on('end', function(){
    console.log(cache.count)
  })

answered May 18 '14 at 08:18

markuz-gj

219
1
8

I am not sure if you noticed that this question is almost a year old. Regarding tar I think it was originally invented to write files on tape. We still use it a lot on Linux. Usually operations like in my case reading is done on file basis NOT the whole archive. looks like it is still the recommended way to do it in nodejs: http://stackoverflow.com/questions/21989460/node-js-specify-files-to-unzip-with-zlib-tar. Just out of curiosity, is your solution faster than mine? – moin moin May 18 '14 at 18:43

is something wrong with how I read tgz files in Node.js? Benchmark says it is slow :(

1 Answers1