3

I've got a Node application which listens to a websocket data feed and acts on it by talking to another API. I'm now running into performance problems. Most of the time things are quiet, with the CPU at about 2-5%, but sometimes (about 3 times per 24 hours) the websocket feed we receive suddenly goes wild for a couple minutes with a lot of data. This makes the application do a lot of calculations, causing the CPU to spike to 100%, and causing all sorts of other trouble. I cannot predict these busy times and I can't really replicate it in a testing setup either. For these reasons I'm having a hard time profiling these spikes.

I'm not a Node guru, but I tried profiling this application using the node --prof flag, followed by the --prof-process flag (on a 3GB isolate-0x321c640-v8.log file). This is ok-ish, but the problem is that if I do that I profile the whole time it ran, instead of the high-traffic part of the time it ran.

I checked out the isolate-0x321c640-v8.log file (see excerpt below) hoping for some sort of timestamp on every line so that I could isolate the time I'm interested in, but I can't find anything like that in there.

tick,0x8ad1f58c24,26726463388,0,0x3fedc8b5859026ea,0,0x8ad76332f8,0x8ad7619f68,0x84113fbe10b,0x8ad12fd54f,0x8ad734f837,0x8ad735192b,0x8ad59c2598,0x8ad59c9765
tick,0x8ad1f6d472,26726464443,0,0x3ff76afe21366278,0,0x8ad7633873,0x8ad7619f68,0x84113fbe10b,0x8ad12fd54f,0x8ad734f837,0x8ad735192b,0x8ad59c2598,0x8ad59c9765
tick,0x8ad1206bd5,26726465499,0,0x8ad1f58c40,0,0x8ad76332f8,0x8ad7619f68,0x84113fbe10b,0x8ad12fd54f,0x8ad734f837,0x8ad735192b,0x8ad59c2598,0x8ad59c9765
tick,0x8ad1f6d472,26726466552,0,0x400040d9bba74cfb,0,0x8ad763377d,0x8ad7619f68,0x84113fbe10b,0x8ad12fd54f,0x8ad734f837,0x8ad735192b,0x8ad59c2598,0x8ad59c9765
tick,0x8ad1f591fa,26726467615,0,0x3fe94cccccccccce,0,0x8ad7626638,0x8ad761c1d9,0x84113fbe10b,0x8ad12fd54f,0x8ad734f837,0x8ad735192b,0x8ad59c2598,0x8ad59c9765
tick,0x8ad1f6d472,26726468680,0,0x7ffcc894f270,0,0x8ad1f59054,0x8ad7626638,0x8ad761c1d9,0x84113fbe10b,0x8ad12fd54f,0x8ad734f837,0x8ad735192b,0x8ad59c2598,0x8ad59c9765
tick,0x8ad1f6d41c,26726469744,0,0x329ab68,0,0x8ad7626cc9,0x8ad761c1d9,0x84113fbe10b,0x8ad12fd54f,0x8ad734f837,0x8ad735192b,0x8ad59c2598,0x8ad59c9765

Is there a good way to profile these specific times during runtime?

halfer
  • 19,824
  • 17
  • 99
  • 186
kramer65
  • 50,427
  • 120
  • 308
  • 488

3 Answers3

5

An easy answer for your problem I can't suggest but I prepared the following roadmap to learn more on the node.js tools and techniques for profiling and optimization that can be helpful for you.

Alexandru Olaru
  • 6,842
  • 6
  • 27
  • 53
0

I would instead of hunting down and hoping to "catch" the time frame the issue is occurring to try to simulate it via stress testing your app.

First setup the tools you need to monitor what is happening. Anything you find readable and helpful should be used here. For example you can use 0x, v8-profiler, heap-profile etc. The key here is to try those and see what is the most readable and easy to follow since some of them have so much data that you get lost in it.

Then get a sample of the data coming in and use libs like Artilery etc to make tons of requests and see what gives. I mean between the flame graphs of 0x and all the data from the other debugging libs you should have enough information to reach a conclusion.

If simulating this is not possible then you can simply run any of the profiling libs on a setInterval and write to the disk the data. For example heap-profile is used with this approach:

const heapProfile = require('heap-profile');

heapProfile.start();

// Write a snapshot to disk every hour
setInterval(() => {
  heapProfile.write((err, filename) => {
    console.log(`heapProfile.write. err: ${err} filename: ${filename}`);
  });
}, 60 * 60 * 1000).unref();

So every hour you would have a data file to look into.

If your node crashes then you could write the dump in the process.catch:

process.on('uncaughtException', function (err) {
  heapProfile.write(...)
});

Hope this helps.

Note: process.on('uncaughtException' ... is still under debate as if it is the correct approach to handle the crashes of node from my understanding vs the use of domains.

Akrion
  • 18,117
  • 1
  • 34
  • 54
0

Have you thought about integrating a 3rd party Application Performance Management (APM) like NewRelic, AppDynamics , Atatus or Keymetrics? They all have their strengths and weaknesses and different pricing plans. But they all share the goal to help developers to better understand their applications to pin down problems like yours.

Some of them have a free trial. That might be a good start to see if it fits your needs or not.

haggis
  • 407
  • 2
  • 20