1

I'm receiving SIGSEGV quite randomly when running an express app with PM2. The strange thing is the server runs quite well for the past few weeks. It does not print any error message except:

App [XXX] with id [7] and pid [27757], exited with code [255] via signal [SIGSEGV]      

After implementing the "segfault-handler" module, I started to receive some stack traces. It seems the app encounters a few different segmentation fault:

/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7fd211f87330]
node(_ZN2v88internal9HashTableINS0_15ObjectHashTableENS0_20ObjectHashTableShapeENS0_6HandleINS0_6ObjectEEEE18FindInsertionEntryEj+0x40)[0xc0b680]
node(_ZN2v88internal15ObjectHashTable3PutENS0_6HandleIS1_EENS2_INS0_6ObjectEEES5_i+0x124)[0xc0c0a4]
node(_ZN2v88internal7Runtime17WeakCollectionSetENS0_6HandleINS0_16JSWeakCollectionEEENS2_INS0_6ObjectEEES6_i+0x59)[0xc7d639]
node(_ZN2v88internal25Runtime_WeakCollectionSetEiPPNS0_6ObjectEPNS0_7IsolateE+0x11d)[0xc7d89d]
[0x2acdd80963b]

/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f0fc311c330]
node(_ZN2v88internal32IncrementalMarkingMarkingVisitor26VisitFixedArrayIncrementalEPNS0_3MapEPNS0_10HeapObjectE+0x376)[0xad8a16]
node(_ZN2v88internal18IncrementalMarking4StepElNS1_16CompletionActionENS1_18ForceMarkingActionENS1_21ForceCompletionActionE+0x2c1)[0xad6181]
node(_ZN2v88internal8NewSpace15SlowAllocateRawEiNS0_19AllocationAlignmentE+0x74)[0xb05244]
node(_ZN2v88internal4Heap11AllocateRawEiNS0_15AllocationSpaceES2_NS0_19AllocationAlignmentE+0x1b9)[0xa678c9]
node(_ZN2v88internal4Heap20AllocateFillerObjectEibNS0_15AllocationSpaceE+0x19)[0xab00b9]
node(_ZN2v88internal7Factory15NewFillerObjectEibNS0_15AllocationSpaceE+0x2d)[0xa67d1d]
node(_ZN2v88internal29Runtime_AllocateInTargetSpaceEiPPNS0_6ObjectEPNS0_7IsolateE+0x5e)[0xc99e8e]
[0x249862c06355]

/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7fbebabd2330]
node(_ZN2v88internal9HashTableINS0_15ObjectHashTableENS0_20ObjectHashTableShapeENS0_6HandleINS0_6ObjectEEEE18FindInsertionEntryEj+0x40)[0xc0b680]
node(_ZN2v88internal15ObjectHashTable3PutENS0_6HandleIS1_EENS2_INS0_6ObjectEEES5_i+0x124)[0xc0c0a4]
node(_ZN2v88internal7Runtime17WeakCollectionSetENS0_6HandleINS0_16JSWeakCollectionEEENS2_INS0_6ObjectEEES6_i+0x59)[0xc7d639]
node(_ZN2v88internal25Runtime_WeakCollectionSetEiPPNS0_6ObjectEPNS0_7IsolateE+0x11d)[0xc7d89d]
[0x125b9620963b]

I know there is little information here. Can anyone please tell me a good way to start diagnosing? I've checked the PM2 log, mongoDB log but no luck.

Thanks! Mars

Mars Zhu
  • 296
  • 2
  • 13
  • I'd start by looking into non-built-in modules you are using that have their own binary/compiled code (non-Javascript). That seems the more likely place that a segFault could happen or the damage could occur that would later lead to a segFault. Also, what exact version of node.js are you running? – jfriend00 Feb 21 '17 at 17:23
  • @jfriend00 thank you for your reply. My node version is 4.7.3 – Mars Zhu Feb 21 '17 at 17:35
  • You may want to research any module with binaries to make 100% sure the exact version you have loaded has tested compatibility with v4.7.3. Is there a specific reason you aren't using v6.x? – jfriend00 Feb 21 '17 at 17:43
  • Thank you for your advice. I've removed modules which compile to .node files (in my case they are memwatch-next, heapdump, segfault-handler). So far the app hasn't caused any segment fault. I'll keep it running for a few hours and come back with results. – Mars Zhu Feb 21 '17 at 17:55
  • I'll also upgrade my node to v6.x. The weird thing is that I've been running the same code and modules for the past month and it never threw any segmentation faults until today after a js code update... – Mars Zhu Feb 21 '17 at 17:57
  • Probably only want to change one thing at a time so you can figure out what caused it. So, if you've removed some modules, I'd stay on the same node version until you get your test results from that first change. Did your JS code update happen to update any modules to newer versions? – jfriend00 Feb 21 '17 at 17:58
  • @jfriend00 the server just received another SIGSEGV so I guess something else is causing it. The JS code update didn't update any modules except adding a new field in an existing mongoose schema but I doubt this is the reason. – Mars Zhu Feb 21 '17 at 18:02
  • Any change of an incompatibility in Mongoose with your new schema and older data stores? – jfriend00 Feb 21 '17 at 18:10
  • Does add an extra Boolean field in the schema raise incompatibility issues? I can save/read the extra field without any issue. – Mars Zhu Feb 21 '17 at 18:16
  • I don't know Mongoose specifically, but since that's what you changed, that would be my first suspect until you rule it out. Is there a way to start with a new database created with the new schema and run with that for awhile? – jfriend00 Feb 21 '17 at 18:17
  • Hmm the server is live and receiving traffic. I've tried to create a separate db in my staging server but I didn't receive the SIGSEGV signal with some fake traffic. I also suspect that the segmentation fault may be caused by Garbage Collection because the segmentation fault is triggered after a different JS function everytime without a clear pattern. – Mars Zhu Feb 21 '17 at 18:22
  • I rather doubt the garbage collector all by itself is causing a segfault (that is heavily tested and used code). But, something else (native code) could have clobbered memory earlier which shows up as a segfault later. Or, a native code module could have done something wrong with memory management that causes a later issue in GC. Here's one question that discusses schema changes in Mongoose: http://stackoverflow.com/questions/7617002/dealing-with-schema-changes-in-mongoose – jfriend00 Feb 21 '17 at 18:26
  • @jfriend00 thank you for the post, I'll check it now. If a native code module is messing up with memory management as you described. Will a server restart (app, hardware) fix the issue? – Mars Zhu Feb 21 '17 at 18:31
  • It depends upon when the corruption is happening. If it's happening early in the life of the server (like near startup), then a restart likely won't help (it will just get corrupted right away after restarting) and the ticking time bomb will still be there waiting to go off. If it's only getting corrupted later after some build up of operations, then a restart may give it more good time to run. Hard to know. How long does it usually take to crash after restarting? – jfriend00 Feb 21 '17 at 18:34
  • About 10 mins. What about a hardware reset? – Mars Zhu Feb 21 '17 at 18:38
  • If you're not doing anything unusual hardware-related (just code and disk and network), then it would be rare that a full reboot would make a difference, but if you haven't done a reboot in a long time, it's worth doing one now. I wouldn't do one regularly unless you somehow prove it makes a difference. I've done some nodejs programming on a Raspberry Pi running Linux using some add-on hardware (temperature probes) and did occasionally find I needed to do a system reboot. But, now that the code is stable, I've only had to reboot the hardware once in two years. – jfriend00 Feb 21 '17 at 18:48
  • I'm giving it a try anyway since I'm that desperate :) Will MongoDB error contribute to Node JS' segmentation fault? And also is there anything I can do with those v8 class stack traces? I've checked around but no one seems to be talking about them. – Mars Zhu Feb 21 '17 at 18:51
  • Did you ever try with the latest node v6.x to see if that makes any difference? – mscdex Feb 21 '17 at 21:17
  • I've just upgraded to 6.9.5 and it hasn't crashed the server for 30 mins! I really think this could be the solution but needs to test a bit more to confirm! – Mars Zhu Feb 22 '17 at 06:19
  • Thank you very much @jfriend00, can you please submit an answer and I'll mark it accepted once I confirm! – Mars Zhu Feb 22 '17 at 06:19

2 Answers2

1

Since the stack trace is different every time and not very illuminating, all you can do is try things. The first main suspects will be things that use native code because it's not that likely that plain Javascript is causing a segFault. It is probably native code that is somehow corrupting memory or not properly interacting with the garbage collector in node.js.

So, the things to look for are the interaction between your current version of node.js and the things you have that use native code (such as mongoDB). Here are things to try:

  1. Identify all modules that use native code and temporarily remove any that you can live without.

  2. Upgrade both node.js and mongoDB to recent versions in case you have some interaction between their specific versions that is causing the problem. If you can't upgrade node.js to a recent stable version, then make absolutely sure that all the modules you are running are certified to be stable with the version of node.js that you do have.

  3. Restart your server just in case there's anything goofed up in the OS that is contributing to the problem.

  4. Start with a clean database or run some sort of database check on your database in order to verify that there is no corruption there.

  5. Whenever you update your DB scheme, make sure you have a strategy for moving the prior database forward (it looks like in MongoDB you can just make sure you assign a default value to new scheme elements).

  6. Gather new info after making changes and repeat the process, trying to only change one thing at a time so that if it fixes the issue you will know exactly which item it was that fixed it.

jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • Thank you @jfriend00. In my case I've removed modules which compile to .node files (memwatch-next, heapdump, segfault-handler). Then upgraded my Node JS to v6.9.5 and the segmentation fault has gone! – Mars Zhu Feb 22 '17 at 06:51
0

Something like that can happen when you copy the code with node_modules that included binary modules compiled for a different architecture than the one you're trying to run it on.

Try either removing node_modules and running npm install from scratch, or you can try running npm rebuild without removing node_modules.

rsp
  • 107,747
  • 29
  • 201
  • 177
  • thank you for your solution. It may help some other situation but the application still crash in my server after running fresh npm install and npm rebuild – Mars Zhu Feb 21 '17 at 17:01