52

This problem is killing the stability of my production servers.

To recap, the basic idea is that my node server(s) sometimes intermittently slow down, sometimes resulting in Gateway Timeouts. As best as I can tell from my logs, something is blocking the node thread (meaning that the incoming request is not accepted), but I cannot for the life of me figure out what.

The problem ranges in severity. Sometimes what should be <100ms requests take ~10 seconds to complete; sometimes they never even get accepted by the node server at all. In short, it is as though some random task is working and blocking the node thread for a period of time, thus slowing down (or even blocking) incoming requests; the one thing I can say for sure is that the need-to-fix-symptom is a "Gateway Timeout".

The issue comes and goes without warning. I have not been able to correlate it against CPU usage, RAM usage, uptime, or any other relevant statistic. I've seen the servers handle a large load fine, and then have this error with a small load, so it does not even appear to be load-related. It is not unusual to see the error around 1am PST, which is the smallest load time of the day! Restarting the node app does seem to maybe make the problem go away for a while, but that really doesn't tell me much. I do wonder if it might be a bug in node.js... not very comforting, considering it is killing my production servers.

I've been logging my server with nodetime and here's the jist of it:

  • CentOS 5.2 running on the Amazon Cloud (m1.large instance)
  • Greater than 5000 MB free memory at all times
  • Less than 150 MB heap size at all times
  • CPU usage is less than 60% at all times

I've also checked my MongoDB servers, which have <5% CPU usage and no requests are taking > 100ms to complete, so I highly doubt there's a bottleneck.

I've wrapped (almost) all my code using Q-promises (see code sample), and of course have avoided Sync() calls like the plague. I've tried to replicate the issue on my testing server (OSX), but have had little luck. Of course, this may be just because the production servers are being used by so many people in so many unpredictable ways that I simply cannot replicate via stress tests...

Community
  • 1
  • 1
Zane Claes
  • 14,732
  • 15
  • 74
  • 131
  • 1
    +1, Nice question. Hope you get it sorted out. – Alix Axel Oct 16 '12 at 02:29
  • 2
    Unfortunately problems like these are incredibly hard to pin down without actually being able to see the code and debug directly since it could be due to so many different things. The best anybody can do to help is make guesses. When real debugging has failed, I just start commenting out code until I don't see the problem anymore and then slowly add back functionality in order to pinpoint the issue. – Bill Oct 17 '12 at 19:57
  • @Bill, I'm usually of the same mindset when it comes to commenting out code. Unfortunately, I can only get the bug to appear on production, so it's not really an option to arbitrarily remove functionality. That said, if you want to send me your email to Zane@Claes.biz I'll give you ReadOnly access to the SVN... assuming you enjoy banging your head against problems like this =P – Zane Claes Oct 17 '12 at 22:12
  • similer but wiyerd problem i use flash([jpegcam](https://github.com/josher19/node-jpegcam)) to capture and upload photo from browser. after uploading that page post some info about that, but that perticular request is not catched by node. but if i reload the page that works. the first request from the same time can not be cached while second can be catched by node. one another note here is after uploading from flash. if i load page from another browser that works. – Hitesh Chavda Oct 21 '12 at 15:54
  • I have the same problem with my node+mongoose app on Heroku. Still have no idea. – Rafał Sobota Sep 18 '13 at 15:25

7 Answers7

15

Many months after I first asked this question, I found the answer.

In a nutshell, the problem was that I was not piping a big asset when transferring it from one server to another. In other words, I was downloading an image from one server, before uploading it to a S3 bucket. Instead of streaming the download into the upload, I downloaded the file into memory, and then uploaded it.

I'm not sure why this did not show up as a memory spike, or elsewhere in my statistics.

Zane Claes
  • 14,732
  • 15
  • 74
  • 131
12

My guess is Mongoose. If you are storing large payloads in Mongo, Mongoose can be pretty slow due to how it builds the Mongoose objects. See https://github.com/LearnBoost/mongoose/issues/950 for more details on the problem. If this is the problem you wouldn't see it in Mongo itself since the query returns quickly, but object instantiation could take 75x the query time.

Try setting up timers around (process.hrtime()) before and after you the Mongoose objects are being created to see if that might be the problem. If this is the problem, I would switch to using the node Mongo driver directly instead of going through Mongoose.

Bill
  • 25,119
  • 8
  • 94
  • 125
  • Fascinating, I had no idea! I'm going to test this now and see. – Zane Claes Oct 17 '12 at 20:38
  • Question, though: how would you go about timing the mongoose drivers as opposed to time spent on other operations? Eg, if I put a process.hrtime() right before the mongoose call, and in the first line of the callback, is it not possible that the node thread got called away by some other task and thus the times are not accurate? – Zane Claes Oct 17 '12 at 20:43
  • You would need to go into the Mongoose library and time object construction after a query occurs. – Bill Oct 17 '12 at 20:54
  • Alternatively you can save off one of your .json files and just do a small test like https://gist.github.com/2877516 to see how long it is taking. – Bill Oct 17 '12 at 21:00
  • Thanks for the tip. I fell back on the internal mongoDB driver for the 2 collections which make large database queries (big objects). If this ends up resolving the issue I'll let you know. – Zane Claes Oct 17 '12 at 22:03
  • 3
    @ZaneClaes If you're on Mongoose 3.x you can call [`lean()`](http://mongoosejs.com/docs/api.html#query_Query-lean) on your queries to eliminate the bulk of the typical overhead if you just need plain JS objects and not model instances. – JohnnyHK Oct 18 '12 at 22:52
  • I'm not yet 100% sure, but I think I may have finally discovered the problem, and it is very close to what @Bill guessed. Basically, MongoDB has a 16MB cap on the size of objects inserted into the database. There were rare edge cases where this cap was being exceeded, and it caused an internal warning within Mongo (note: not in Mongoose). I implemented data-caps within my node.js code before inserting objects, and so far it's looking promising... – Zane Claes Oct 30 '12 at 01:06
4

You are heavily leaking memory, try setting every object to null as soon as you don't need it anymore! Read this.

More information about hunting down memory leaks can be found here.

Give special attention to having multiple references to the same object and check if you have circular references, those are a pain to debug but will help you very much.

Try invoking the garbage collector manually every minute or so (I don't know if you can do this in node.js cause I'm more of a c++ and php coder). From my years of experience working with c++ I can tell you the most likely cause of your application slowing down over time is memory leaks, find them and plug them, you'll be ok!

Also assuming you're not caching and/or processing images, audio or video in memory or anything like that 150M heap is a lot! Those could be hundreds of thousands or even millions of small objects.

You don't have to be running out of memory for your application to slow down... just searching for free memory with that many objects already allocated is a huge job for the memory allocator, it takes a lot of time to allocate each new object and as you leak more and more memory that time only increases.

Community
  • 1
  • 1
xception
  • 4,241
  • 1
  • 17
  • 27
  • Not a bad guess, but I disagree on some points. First, the 150Mb is NOT a big heap size for a node.js app serving thousands of users, as mine does. Furthermore, Node supports heaps up to 1.4 Gb. Other node coders have insisted that circular references are not an issue in node gc and should not impact performance. Finally, manually invoking the GC MIGHT help SOME, but it's generally bad practice... still, I'll give it a shot. – Zane Claes Oct 15 '12 at 23:58
  • I only stated 150M is a lot in the context of small objects, if you have large or very large objects it's nothing... but based on your saying that the garbage collector takes 4 minutes to run after your application runs for 20 minutes to 6 hours, I'm guessing your case is the one where most objects are just up to hundreds of bytes, that generates a lot of memory allocations which the allocator has to re-check with each new alloc to see which blocks are still free. – xception Oct 16 '12 at 00:04
  • As a simple example (I know it's absurd, just trying to make a point) if all your objects would be 150 bytes, ignoring the overhead from the memory allocator, it would mean a million objects, now checking that many elements for a free memory block will inevitably take a significant amount of time. Most of what I said should still stand even if node.js has stellar memory management, setting your variables to null as soon as you don't need them help the garbage allocator know that they are no longer used earlier and possibly with less detection logic. – xception Oct 16 '12 at 00:14
  • Again, I'm not a js coder, so some of my points might not be the best advice... but I have seen what a huge impact allocating millions of very small objects has on c++ and if that's your case also you should really do your best to keep that memory used for as little as possible. – xception Oct 16 '12 at 00:17
  • 1
    My objects are not small. In fact, many of them are in the several Kb, ever into the Mb range. I am doing a lot of retrieval, parsing, and caching (into MongoDB) of JSON objects retrieve from remote servers, and this is the vast majority of my memory usage... so I still disagree with your points. I only have 500-2,000 objects comprising the 150 Mb (which is the MAX heap size) and I can't really minimize them any further. – Zane Claes Oct 16 '12 at 02:31
  • In that case you are right, that's not too much, sorry for wasting your time, I still think it's weird your garbage collector takes 4 minutes to run. Added this question to my favorites, really curious what the answer turns out to be. – xception Oct 16 '12 at 08:47
  • Just remembered what apache does with python apps in my webserver configuration, it's configured to just proxy the request to one of a few instances of python and each instance will only serve up to 1000 requests, after that dies and another instance is started in it's place, maybe you could do the same here. That might avoid the problem, even if you would use just a single instance (you'd still need the proxying so that the application is not unavailable while you reload it) – xception Oct 16 '12 at 09:10
  • note that it is 4 SECONDS that the GC runs, not 4 MINUTES. Still, this is far too much, because node is single threaded and a 4 second lock on the thread means that it is blocking incoming connections. Interesting idea on the proxying. You know, I was just thinking last night how it was weird that this problem only came about lately. Then, I realized that my server used to get restarted by forever due to errors in the code... which would have hidden this problem from view ;) – Zane Claes Oct 16 '12 at 16:54
  • Let me know how it goes when you implement that, I am in the process of developing an application which needs to be able to handle very many requests as well... so I might do something similar as well. – xception Oct 16 '12 at 18:23
  • Thanks for the encouragement. I just setup cluster to automatically respawn workers every 1hr. We'll see how it goes; I'll post if it works. – Zane Claes Oct 16 '12 at 20:21
  • Whelp, the cluster idea did not work. Even though it restarts itself every 30min, the issue still persists. – Zane Claes Oct 17 '12 at 15:28
1

Is "--nouse-idle-connection" a mistake? do you really mean "--nouse_idle_notification".

I think it's maybe some issues about gc with too many tiny objects. node is single process, so watch the most busy cpu core is much important than the load. when your program is slow, you can execute "gdb node pid" and "bt" to see what node is busy doing.

pb2q
  • 58,613
  • 19
  • 146
  • 147
darcy
  • 11
  • 1
  • Oops, yep, that's what I meant! I've looked back and made sure that was the flag I used (+1 for the catch). Even if the problem were the GC cleaning up small objects... how would I fix it? I've already tried many ways of fixing GC. – Zane Claes Oct 18 '12 at 15:27
  • I've not met any gc problems. It's just an assumption. But I think finding the problem is more import than trying virious methods. – darcy Oct 24 '12 at 02:46
1

What I'd do is set up a parallel node instance on the same server with some kind of echo service and test that one. If it runs fine, you narrow down your problem to your program code (and not a scheduler/OS-level problem). Then, step by step, include the modules and test again. Certainly this is a lot of work, takes long and I dont know if it is doable on your system.

Chris
  • 4,204
  • 1
  • 25
  • 30
  • +1 for the idea of setting up a parallel node instance. Unfortunately, I can't step-by-step re-include modules... I can only get the problem to appear on my production servers, and I can't just arbitrarily take features offline there :( – Zane Claes Oct 19 '12 at 14:58
1

If you need to get this working now, you can go the NASA redundancy route:

Bring up a second copy of your production servers, and put a proxy in front of them which routes each request to both stacks and returns the first response. I don't recommend this as a perfect long-term solution but it should help significantly reduce issues in production now, and help you gather log data that you could replay to recreate the issues on non-production servers.

Obviously, this is straight-forward for read requests, but more complex for commands which write to the db.

Eric
  • 3,142
  • 17
  • 14
0

We have a similar problem with our Node.js server. It didn't scale well for weeks and we had tried almost everything as you had. Our problem was in the implicit backlog value which is set very low for high-concurrent environments.

http://nodejs.org/api/http.html#http_server_listen_port_hostname_backlog_callback

Setting the backlog to a significantly higher value (e.g. 10000) as well as tune networking in our kernel (/etc/sysctl.conf on Linux) as described in manual section helped a lot. From this time forward we don't have any timeouts in our Node.js server.

krux
  • 1