16

I'm using node v0.12.7 and want to stream directly from a database to the client (for file download). However, I am noticing a large memory footprint (and possible memory leak) when using streams.

With express, I create an endpoint that simply pipes a readable stream to the response as follows:

app.post('/query/stream', function(req, res) {

  res.setHeader('Content-Type', 'application/octet-stream');
  res.setHeader('Content-Disposition', 'attachment; filename="blah.txt"');

  //...retrieve stream from somewhere...
  // stream is a readable stream in object mode

  stream
    .pipe(json_to_csv_transform_stream) // I've removed this and see the same behavior
    .pipe(res);
});

In production, the readable stream retrieves data from a database. The amount of data is quite large (1M+ rows). I swapped out this readable stream with a dummy stream (see code below) to simplify debugging and am noticing the same behavior: my memory usage jumps up by ~200M each time. Sometimes the garbage collection will kick in and the memory drops down a bit, but it linearly rises until my server runs out of memory.

The reason I started using streams was to not have to load large amounts of data into memory. Is this behavior expected?

I also notice that, while streaming, my CPU usage jumps to 100% and blocks (which means other requests can't be processed).

Am I using this incorrectly?

Dummy readable stream code

// Setup a custom readable
var Readable = require('stream').Readable;

function Counter(opt) {
  Readable.call(this, opt);
  this._max = 1000000; // Maximum number of records to generate
  this._index = 1;
}
require('util').inherits(Counter, Readable);

// Override internal read
// Send dummy objects until max is reached
Counter.prototype._read = function() {
  var i = this._index++;
  if (i > this._max) {
    this.push(null);
  }
  else {
    this.push({
      foo: i,
      bar: i * 10,
      hey: 'dfjasiooas' + i,
      dude: 'd9h9adn-09asd-09nas-0da' + i
    });
  }
};

// Create the readable stream
var counter = new Counter({objectMode: true});

//...return it to calling endpoint handler...

Update

Just a small update, I never found the cause. My initial solution was to use cluster to spawn off new processes so that other requests could still be handled.

I've since updated to node v4. While cpu/mem usage is still high during processing, it seems to have fixed the leak (meaning mem usage goes back down).

lebolo
  • 2,120
  • 4
  • 29
  • 44
  • Why don't you use this stream to write a temp file and send the path to user so he can download it and after that you remove it? – Scoup Sep 21 '15 at 17:43
  • Your dummy code is actually a synchronous stream. So it'll block other code excution. But that doesn't explain high cpu/memory usage. What happens if you make it asynchronous? like using `this.push` inside `setImmediate` ? – hassansin Sep 22 '15 at 00:41
  • @Scoup I looked into this a while back and piping to a file (via `fs.createWriteStream`) gave the same high cpu/memory behavior. – lebolo Sep 22 '15 at 14:52
  • @hassansin I haven't tried on the dummy code, but I tried modifying [the actual readable stream](https://github.com/brianc/node-pg-query-stream/blob/master/index.js#L62) code I'm using (via `process.nextTick`). When I did it within the for loop, it didn't stream at all. Outside of the for loop, I saw no change in behavior. – lebolo Sep 22 '15 at 15:56
  • Also tried with `setImmediate` – lebolo Sep 22 '15 at 16:02
  • Relevant question http://stackoverflow.com/questions/25237013/node-js-unbounded-concurrency-stream-backpressure-over-tcp – Vanuan Oct 03 '16 at 16:47

5 Answers5

13

Update 2: Here's a history of various Stream APIs:

https://medium.com/the-node-js-collection/a-brief-history-of-node-streams-pt-2-bcb6b1fd7468

0.12 uses Streams 3.

Update: This answer was true for old node.js streams. New Stream API has a mechanism to pause readable stream if writable stream can't keep up.

Backpressure

It looks like you've been hit by classic "backpressure" node.js problem. This article explains it in detail.

But here's a TL;DR:

You're right, streams are used to not have to load large amounts of data into memory.

But unfortunately streams don't have a mechanism to know if it's ok to continue streaming. Streams are dumb. They're just throwing data in the next stream as fast as they can.

In your example you're reading a large csv file and streaming it to the client. The thing is that the speed of reading a file is greater than the speed of uploading it through the network. So data needs to be stored somewhere until they can be successfully forgotten. That's why your memory keeps growing until the client finished downloading.

The solution is to throttle the reading stream to the speed of the slowest stream in the pipe. I.e. you prepend your reading stream with another stream which will tell your reading stream when is it ok to read the next chunk of data.

Vanuan
  • 31,770
  • 10
  • 98
  • 102
  • 1
    This was the answer. Adding this module https://github.com/TooTallNate/node-throttle reduced peak memory for a large file-to-file stream from 6GB to 100MB – prototype Aug 05 '21 at 00:26
6

It appears you are doing everything correctly. I copied your test case and am experiencing the same issue in v4.0.0. Taking it out of objectMode and using JSON.stringify on your object appeared to prevent both high memory and high cpu. That lead me to the built in JSON.stringify which appears to be the root of the problem. Using the streaming library JSONStream instead of the v8 method fixed this for me. It can be used like this: .pipe(JSONStream.stringify()).

Cody Gustafson
  • 1,440
  • 11
  • 13
  • Hmm, the actual readable stream I use is [node-pg-query-stream](https://github.com/brianc/node-pg-query-stream), which provides rows from the DB as JSON objects (from a custom `Cursor` object). I think it would take a lot of work to get it to stream rows without `objectMode`. – lebolo Sep 22 '15 at 16:08
  • It looks like their Readme actually has an example with JSONstream. That is almost certainly the issue. Did you try the `.pipe` that was in my edit? You may need a similar solution for the csv output it looks like you are trying for. – Cody Gustafson Sep 22 '15 at 16:18
  • Yeah, I tried using `stream.pipe(JSONStream.stringify()).pipe(res)` (where objectMode is still on since `stringify` expects an object) and am still seeing the same high usage. – lebolo Sep 22 '15 at 16:37
  • Hmm, I thought that would do it. Is there anything else happening with `stream`? With just that stringify I wouldn't expect any issues. – Cody Gustafson Sep 22 '15 at 16:46
  • Any update @lebolo? Did you end up finding out what caused the high memory usage? – Cody Gustafson Sep 27 '15 at 18:32
  • 1
    I made this trick and it works. High GC goes away after some time. Before that (using JSON.stringify) garbage collection gots very hight forever. – calebeaires Jan 05 '22 at 13:25
2

Just try this before all:

  1. Add manual/explicit garbage collection calls to your app, and
  2. Add heapdump npm install heapdump
  3. Add code to clean garbage and dump the rest to find a leak:

    var heapdump = require('heapdump');
    
    app.post('/query/stream', function (req, res) {
    
        res.setHeader('Content-Type', 'application/octet-stream');
        res.setHeader('Content-Disposition', 'attachment; filename="blah.txt"');
    
        //...retrieve stream from somewhere...
        // stream is a readable stream in object mode
    
        global.gc();
        heapdump.writeSnapshot('./ss-' + Date.now() + '-begin.heapsnapshot');
    
        stream.on('end', function () {
            global.gc();
            console.log("DONNNNEEEE");
            heapdump.writeSnapshot('./ss-' + Date.now() + '-end.heapsnapshot');
        });
    
        stream
                .pipe(json_to_csv_transform_stream) // I've removed this and see the same behavior
                .pipe(res);
    });
    
  4. Run your application with node's key --expose_gc: node --expose_gc app.js

  5. Investigate dumps with Chrome

After I forced garbage collection on an application I assembled, the memory usage was back to normal (67MB. approx.). Which means:

  1. Maybe GC had not being running in such a short period and there is not a leak at all (The major garbage collection cycle can idle for quite a while before start). Here is a good article on V8 GC, however not a word on GC exact timings, only in comparison of gc cycles to each other, but it's clear that the less time spent on major GC the better.

  2. I did not recreated you issue well. Then, please, take a look here and help me to reproduce the issue better.

Alexander Arutinyants
  • 1,619
  • 2
  • 23
  • 49
0

It's too easy to have a memory leak in Node.js

Usually, it's a minor thing, like declaring variable after creating anonymous function or using a function argument inside a callback. But it makes a huge difference on closure context. Thus some variables can never be freed.

This article explains different types of memory leaks you may have and how to find them. The number 4 - Closures - is the most common one.

I've found a rule that would allow you to avoid leaks:

  1. Always declare all your variables before assigning them.
  2. Declare functions after you declared all variables
  3. Avoid closures anywhere near loops or big chunks of data
Vanuan
  • 31,770
  • 10
  • 98
  • 102
-3

To me it looks like you are load testing multiple stream modules. That is a nice service to provide for the Node community, but you may also consider just caching the postgres data dump to a file, gzip, and serve a static file.

Or maybe make your own Readable that uses a cursor and outputs CSV (as string/text).

Jason Livesay
  • 6,317
  • 3
  • 25
  • 31