Node.js app has periodic slowness and/or timeouts (does not accept incoming requests)

Question

This problem is killing the stability of my production servers.

To recap, the basic idea is that my node server(s) sometimes intermittently slow down, sometimes resulting in Gateway Timeouts. As best as I can tell from my logs, something is blocking the node thread (meaning that the incoming request is not accepted), but I cannot for the life of me figure out what.

The problem ranges in severity. Sometimes what should be <100ms requests take ~10 seconds to complete; sometimes they never even get accepted by the node server at all. In short, it is as though some random task is working and blocking the node thread for a period of time, thus slowing down (or even blocking) incoming requests; the one thing I can say for sure is that the need-to-fix-symptom is a "Gateway Timeout".

The issue comes and goes without warning. I have not been able to correlate it against CPU usage, RAM usage, uptime, or any other relevant statistic. I've seen the servers handle a large load fine, and then have this error with a small load, so it does not even appear to be load-related. It is not unusual to see the error around 1am PST, which is the smallest load time of the day! Restarting the node app does seem to maybe make the problem go away for a while, but that really doesn't tell me much. I do wonder if it might be a bug in node.js... not very comforting, considering it is killing my production servers.

The first thing I did was to make sure I had upgraded node.js to the latest (0.8.12), as well as all my modules (here they are). Of course, I also have plenty of error catchers in place. I'm not doing anything funky like printing out lots to the console or writing to lots of files.
At first, I thought it was outbound HTTP requests blocking the incoming socket, because the express middleware was not even picking up the inbound request, but I gave up the theory because it looks like the node thread itself became busy.
Next, I went through all my code with JSHint and fixed literally every single warning, including a few accidental globals (forgetting to write "var") but this didn't help
After that, I assumed that perhaps I was running out of memory. But, my heap snapshots via nodetime are looking pretty good now (described below).
Still thinking that memory might be an issue, I took a look at garbage collection. I enabled the --nouse-idle-notification flag and did some more code optimization to NULL objects when they were not needed.
Still convinced that memory was the issue, I added the --expose-gc flag and executed the gc(); command every minute. This did not change anything, except to occasionally make requests a bit slower perhaps.
In a desperate attempt, I setup the "cluster" module to use 2 workers and automatically restart them every 30 min. Still, no luck.
I increased the ulimit to over 10,000 and kept an eye on the open files. There seem to be < 300 open files (or sockets) per node.js app, and increasing the ulimit thus had no impact.

I've been logging my server with nodetime and here's the jist of it:

CentOS 5.2 running on the Amazon Cloud (m1.large instance)
Greater than 5000 MB free memory at all times
Less than 150 MB heap size at all times
CPU usage is less than 60% at all times

I've also checked my MongoDB servers, which have <5% CPU usage and no requests are taking > 100ms to complete, so I highly doubt there's a bottleneck.

I've wrapped (almost) all my code using Q-promises (see code sample), and of course have avoided Sync() calls like the plague. I've tried to replicate the issue on my testing server (OSX), but have had little luck. Of course, this may be just because the production servers are being used by so many people in so many unpredictable ways that I simply cannot replicate via stress tests...

Unfortunately problems like these are incredibly hard to pin down without actually being able to see the code and debug directly since it could be due to so many different things. The best anybody can do to help is make guesses. When real debugging has failed, I just start commenting out code until I don't see the problem anymore and then slowly add back functionality in order to pinpoint the issue. — Bill, Oct 17 '12 at 19:57
@Bill, I'm usually of the same mindset when it comes to commenting out code. Unfortunately, I can only get the bug to appear on production, so it's not really an option to arbitrarily remove functionality. That said, if you want to send me your email to Zane@Claes.biz I'll give you ReadOnly access to the SVN... assuming you enjoy banging your head against problems like this =P — Zane Claes, Oct 17 '12 at 22:12
similer but wiyerd problem i use flash([jpegcam](https://github.com/josher19/node-jpegcam)) to capture and upload photo from browser. after uploading that page post some info about that, but that perticular request is not catched by node. but if i reload the page that works. the first request from the same time can not be cached while second can be catched by node. one another note here is after uploading from flash. if i load page from another browser that works. — Hitesh Chavda, Oct 21 '12 at 15:54
I have the same problem with my node+mongoose app on Heroku. Still have no idea. — Rafał Sobota, Sep 18 '13 at 15:25

score 15 · Accepted Answer · answered Mar 28 '13 at 22:06

15

Many months after I first asked this question, I found the answer.

In a nutshell, the problem was that I was not piping a big asset when transferring it from one server to another. In other words, I was downloading an image from one server, before uploading it to a S3 bucket. Instead of streaming the download into the upload, I downloaded the file into memory, and then uploaded it.

I'm not sure why this did not show up as a memory spike, or elsewhere in my statistics.

answered Mar 28 '13 at 22:06

Zane Claes

14,732
15
74
131

1

roughly how large were the images you were transferring? I'm wondering if we are seeing a similar problem. thanks! – sethpollack Dec 24 '15 at 23:02
1

Could you please share a sample of the code related with what you did as a solution for this problem? – Ulysses Alves Jun 08 '16 at 19:52

score 12 · Answer 2 · answered Oct 17 '12 at 20:02

12

My guess is Mongoose. If you are storing large payloads in Mongo, Mongoose can be pretty slow due to how it builds the Mongoose objects. See https://github.com/LearnBoost/mongoose/issues/950 for more details on the problem. If this is the problem you wouldn't see it in Mongo itself since the query returns quickly, but object instantiation could take 75x the query time.

Try setting up timers around (process.hrtime()) before and after you the Mongoose objects are being created to see if that might be the problem. If this is the problem, I would switch to using the node Mongo driver directly instead of going through Mongoose.

answered Oct 17 '12 at 20:02

Bill

25,119
8
94
125

Fascinating, I had no idea! I'm going to test this now and see. – Zane Claes Oct 17 '12 at 20:38
Question, though: how would you go about timing the mongoose drivers as opposed to time spent on other operations? Eg, if I put a process.hrtime() right before the mongoose call, and in the first line of the callback, is it not possible that the node thread got called away by some other task and thus the times are not accurate? – Zane Claes Oct 17 '12 at 20:43
You would need to go into the Mongoose library and time object construction after a query occurs. – Bill Oct 17 '12 at 20:54
Alternatively you can save off one of your .json files and just do a small test like https://gist.github.com/2877516 to see how long it is taking. – Bill Oct 17 '12 at 21:00
Thanks for the tip. I fell back on the internal mongoDB driver for the 2 collections which make large database queries (big objects). If this ends up resolving the issue I'll let you know. – Zane Claes Oct 17 '12 at 22:03
3

@ZaneClaes If you're on Mongoose 3.x you can call [`lean()`](http://mongoosejs.com/docs/api.html#query_Query-lean) on your queries to eliminate the bulk of the typical overhead if you just need plain JS objects and not model instances. – JohnnyHK Oct 18 '12 at 22:52
I'm not yet 100% sure, but I think I may have finally discovered the problem, and it is very close to what @Bill guessed. Basically, MongoDB has a 16MB cap on the size of objects inserted into the database. There were rare edge cases where this cap was being exceeded, and it caused an internal warning within Mongo (note: not in Mongoose). I implemented data-caps within my node.js code before inserting objects, and so far it's looking promising... – Zane Claes Oct 30 '12 at 01:06

score 4 · Answer 3 · edited May 23 '17 at 12:02

4

You are heavily leaking memory, try setting every object to null as soon as you don't need it anymore! Read this.

More information about hunting down memory leaks can be found here.

Give special attention to having multiple references to the same object and check if you have circular references, those are a pain to debug but will help you very much.

Try invoking the garbage collector manually every minute or so (I don't know if you can do this in node.js cause I'm more of a c++ and php coder). From my years of experience working with c++ I can tell you the most likely cause of your application slowing down over time is memory leaks, find them and plug them, you'll be ok!

Also assuming you're not caching and/or processing images, audio or video in memory or anything like that 150M heap is a lot! Those could be hundreds of thousands or even millions of small objects.

You don't have to be running out of memory for your application to slow down... just searching for free memory with that many objects already allocated is a huge job for the memory allocator, it takes a lot of time to allocate each new object and as you leak more and more memory that time only increases.

edited May 23 '17 at 12:02

Community

1
1

answered Oct 14 '12 at 22:48

xception

4,241
1
17
27

Not a bad guess, but I disagree on some points. First, the 150Mb is NOT a big heap size for a node.js app serving thousands of users, as mine does. Furthermore, Node supports heaps up to 1.4 Gb. Other node coders have insisted that circular references are not an issue in node gc and should not impact performance. Finally, manually invoking the GC MIGHT help SOME, but it's generally bad practice... still, I'll give it a shot. – Zane Claes Oct 15 '12 at 23:58
I only stated 150M is a lot in the context of small objects, if you have large or very large objects it's nothing... but based on your saying that the garbage collector takes 4 minutes to run after your application runs for 20 minutes to 6 hours, I'm guessing your case is the one where most objects are just up to hundreds of bytes, that generates a lot of memory allocations which the allocator has to re-check with each new alloc to see which blocks are still free. – xception Oct 16 '12 at 00:04
As a simple example (I know it's absurd, just trying to make a point) if all your objects would be 150 bytes, ignoring the overhead from the memory allocator, it would mean a million objects, now checking that many elements for a free memory block will inevitably take a significant amount of time. Most of what I said should still stand even if node.js has stellar memory management, setting your variables to null as soon as you don't need them help the garbage allocator know that they are no longer used earlier and possibly with less detection logic. – xception Oct 16 '12 at 00:14
Again, I'm not a js coder, so some of my points might not be the best advice... but I have seen what a huge impact allocating millions of very small objects has on c++ and if that's your case also you should really do your best to keep that memory used for as little as possible. – xception Oct 16 '12 at 00:17
1

My objects are not small. In fact, many of them are in the several Kb, ever into the Mb range. I am doing a lot of retrieval, parsing, and caching (into MongoDB) of JSON objects retrieve from remote servers, and this is the vast majority of my memory usage... so I still disagree with your points. I only have 500-2,000 objects comprising the 150 Mb (which is the MAX heap size) and I can't really minimize them any further. – Zane Claes Oct 16 '12 at 02:31
In that case you are right, that's not too much, sorry for wasting your time, I still think it's weird your garbage collector takes 4 minutes to run. Added this question to my favorites, really curious what the answer turns out to be. – xception Oct 16 '12 at 08:47
Just remembered what apache does with python apps in my webserver configuration, it's configured to just proxy the request to one of a few instances of python and each instance will only serve up to 1000 requests, after that dies and another instance is started in it's place, maybe you could do the same here. That might avoid the problem, even if you would use just a single instance (you'd still need the proxying so that the application is not unavailable while you reload it) – xception Oct 16 '12 at 09:10
note that it is 4 SECONDS that the GC runs, not 4 MINUTES. Still, this is far too much, because node is single threaded and a 4 second lock on the thread means that it is blocking incoming connections. Interesting idea on the proxying. You know, I was just thinking last night how it was weird that this problem only came about lately. Then, I realized that my server used to get restarted by forever due to errors in the code... which would have hidden this problem from view ;) – Zane Claes Oct 16 '12 at 16:54
Let me know how it goes when you implement that, I am in the process of developing an application which needs to be able to handle very many requests as well... so I might do something similar as well. – xception Oct 16 '12 at 18:23
Thanks for the encouragement. I just setup cluster to automatically respawn workers every 1hr. We'll see how it goes; I'll post if it works. – Zane Claes Oct 16 '12 at 20:21
Whelp, the cluster idea did not work. Even though it restarts itself every 30min, the issue still persists. – Zane Claes Oct 17 '12 at 15:28

score 1 · Answer 4 · edited Oct 18 '12 at 02:22

1

Is "--nouse-idle-connection" a mistake? do you really mean "--nouse_idle_notification".

I think it's maybe some issues about gc with too many tiny objects. node is single process, so watch the most busy cpu core is much important than the load. when your program is slow, you can execute "gdb node pid" and "bt" to see what node is busy doing.

edited Oct 18 '12 at 02:22

pb2q

58,613
19
146
147

answered Oct 18 '12 at 02:13

darcy

11
1

Oops, yep, that's what I meant! I've looked back and made sure that was the flag I used (+1 for the catch). Even if the problem were the GC cleaning up small objects... how would I fix it? I've already tried many ways of fixing GC. – Zane Claes Oct 18 '12 at 15:27
I've not met any gc problems. It's just an assumption. But I think finding the problem is more import than trying virious methods. – darcy Oct 24 '12 at 02:46

score 1 · Answer 5 · answered Oct 19 '12 at 07:30

1

What I'd do is set up a parallel node instance on the same server with some kind of echo service and test that one. If it runs fine, you narrow down your problem to your program code (and not a scheduler/OS-level problem). Then, step by step, include the modules and test again. Certainly this is a lot of work, takes long and I dont know if it is doable on your system.

answered Oct 19 '12 at 07:30

Chris

4,204
1
25
30

+1 for the idea of setting up a parallel node instance. Unfortunately, I can't step-by-step re-include modules... I can only get the problem to appear on my production servers, and I can't just arbitrarily take features offline there :( – Zane Claes Oct 19 '12 at 14:58

score 1 · Answer 6 · answered Oct 24 '12 at 00:43

If you need to get this working now, you can go the NASA redundancy route:

Bring up a second copy of your production servers, and put a proxy in front of them which routes each request to both stacks and returns the first response. I don't recommend this as a perfect long-term solution but it should help significantly reduce issues in production now, and help you gather log data that you could replay to recreate the issues on non-production servers.

Obviously, this is straight-forward for read requests, but more complex for commands which write to the db.

score 0 · Answer 7 · answered Oct 27 '12 at 11:27

We have a similar problem with our Node.js server. It didn't scale well for weeks and we had tried almost everything as you had. Our problem was in the implicit backlog value which is set very low for high-concurrent environments.

http://nodejs.org/api/http.html#http_server_listen_port_hostname_backlog_callback

Setting the backlog to a significantly higher value (e.g. 10000) as well as tune networking in our kernel (/etc/sysctl.conf on Linux) as described in manual section helped a lot. From this time forward we don't have any timeouts in our Node.js server.

Node.js app has periodic slowness and/or timeouts (does not accept incoming requests)

7 Answers7

Linked