4

We have an application running on several thousand identical machines. Same OS, same hardware, same application installation. On very rare occasions, the machine locks up. Alt tab, ctrl-alt-del, application are all unresponsive. After inspecting our applications log file, a series of null characters are written to the end, as the last data before the crash.

I'm hoping to use this fact as a means to debug the lockup. My guess is that the number of null characters written is equivalent to the space I need to allocate for my log statement, but the content is never actually written to disk. I'm also guessing a disk IO problem occurred, prevent the write, and of course, the OS lockup. I can't confirm of this. So I guess my question is - have you ever seen a condition like this, how did it occur, and how might you go about troubleshooting it?

reuscam
  • 1,841
  • 2
  • 18
  • 23
  • Do you have a memory leak in the code? i.e., does the process keep getting bigger. – ColWhi Apr 18 '11 at 14:18
  • I'd say it is a hardware lockup (DMA or SATA); I don't really expect the nulls found on disk to be significant, but good for you to ask around – sehe Apr 18 '11 at 14:44
  • What you describe looks like kernel-mode deadlock. Your application might interact with some buggy driver. – Serge Dundich Apr 18 '11 at 14:58
  • Are they always "hangs", or do the machines crash with different kinds of errors? If it's always "hangs" you probably have badly designed hardware or bugdrivers. Might also be something overheating - I'd check GPU temperature first. If it's different kinds of crashes it's probably defective hardware. E.g. bad RAM can cause all kinds of errors. ps.: good luck, stuff like that can be really nasty to track down! – Paul Groke Apr 19 '11 at 08:12
  • We have some proprietary drivers, so them being buggy is definitely a suspicion of ours. Id only this was a bsod with a dmp file. Heat is not a problem, we have done extensive testing there. This is a very rare error, probably 1 event in some 3.2 million machine hours. – reuscam Apr 19 '11 at 12:17
  • @reuscam I am having an extremely similar issue with a client, can you shed some insight into how this was eventually handled? What was the outcome? – beeks Feb 10 '17 at 15:30
  • @beeks The accepted answer is what we assumed was going on. Turns out we had some sort of hardware problem, a sata cable with a faulty ground, I think. We fixed that, and the frequency of failures went way down. You might be able find some sort of option to flush immediately, rather than cache, but I'm not sure thats going to fix anything for you. Also, consult your windows event logs for disk errors, and use the disk utility to see if your disk is tracking any. – reuscam Feb 11 '17 at 17:24

2 Answers2

2

I've seen this type of thing happen, I think you're looking in the right general direction.

When this happens I assume you're able to pinpoint the exact hardware? after failure I'd recommend running a memtest (http://www.memtest.org/).

I've seen this sort of thing with power supplies, bad disk controllers, etc. You can go insane trying to track them down.

Seems like you're going about this the right way - see if you can find a way to force the problem to happen more quickly, when it happens run the memtest, run chkdsk /R (check the eventlog for controller errors during this)

any chance you could get a kernel debugger attached?

any chance %SystemRoot%\memory.dmp was produced?

stuck
  • 2,264
  • 2
  • 28
  • 62
  • Because of the age of the device and the environment, we have failures with nearly every component in the machine. (Un)fortunately, the machine recovers after a power cycle, so we can't pinpoint a failure. I will recommend memtest on the next case we see. We will also check for a memory.dmp, but I doubt it was written, as there is no bsod. Chkdsk /r is run automatically after bootup, so we will check the eventlog as well. Kernel debugger is not an option - this is in a production environment, and cannot reliably be reproduced in test. – reuscam Apr 19 '11 at 12:13
2

NTFS does not journal data (only metadata), so things like that can happen. The reason why is just that at the time of the crash/hang, the metadata (file size, data block allocation) was committed, but not the data (data block contents). Unfortunately this is normal behavior with NTFS and will not give you any insight into the problem causing the hang.

So the answer is: a crash at the "right" time can cause this.

BTW: The same thing can of course happen with FAT/FAT32.

Paul Groke
  • 6,259
  • 2
  • 31
  • 32