First the background to this intriguing challenge. The continuous integration build can often have failures during development and testing of deadlocks, loops, or other issues that result in a never ending test. So all the mechanisms for notifying that a build has failed become useless.
The solution will be to have the build script timeout if there's zero output to the build log file for more than 5 minutes since the build routinely writes out the names of unit tests as it proceeds. So that's the best way to identify it's "frozen".
Okay. Now the nitty gritty...
The build server uses Hudson to run a simple bash script that invokes the more complex build script based on Nant and MSBuild (all on Windows).
So far all solutions around the net involve a timeout on the total run time of the command. But that solution fails in this case because the tests might hang or freeze in the first 5 minutes.
What we've thought of so far:
First, here's the high level bash command run the full test suite in Hudson.
build.sh clean free test
That command simply sends all the Nant and MSBuild build logging to stdout.
It's obvious that we need to tee that output to a file:
build.sh clean free test 2>&1 | tee build.out
Then in parallel a command needs to sleep, check the modify time of the file and if more than 5 minutes kill the main process. A kill -9
will be fine at that point--nothing graceful needed once it has frozen.
That's the part you can help with.
In fact, I made a script like this over 15 years ago to kill the connection with a data phone line to japan after periods of inactivity but can't remember how I did it.
Sincerely, Wayne