We have an application running on several thousand identical machines. Same OS, same hardware, same application installation. On very rare occasions, the machine locks up. Alt tab, ctrl-alt-del, application are all unresponsive. After inspecting our applications log file, a series of null characters are written to the end, as the last data before the crash.
I'm hoping to use this fact as a means to debug the lockup. My guess is that the number of null characters written is equivalent to the space I need to allocate for my log statement, but the content is never actually written to disk. I'm also guessing a disk IO problem occurred, prevent the write, and of course, the OS lockup. I can't confirm of this. So I guess my question is - have you ever seen a condition like this, how did it occur, and how might you go about troubleshooting it?