I have a long c program that calculates time evolution of particles and it takes almost 2 to 3 days to complete even when parallelized using OpenMP. Sometimes, when I am bound to close the workstation, I have to stop the runs mid execution and waste all the data and time that went into it.
Is there some way that I can stop the run mid execution, save all the necessary processes, switch off the PC, come back the next day and start the run from the same point it was stopped at. Maybe i can save some "snapshot" and load it the next day to start the run from the same point. I am using a Linux OS.
The structure of my c program (xyz.c) looks something like this :
%necessary header files listed here
int main (void)
{
function1();
...
... % (more functions in between)
...
function2();
while (i < 10000000)
{
function1();
function2();
....
....
%(several more functions)
....
....
function3();
i = i+1;
}
}
Most of the functions share numerous Global variables. Some functions perform calculations on global as well as local variables. And some functions print values of the variables to an output file. The resuming function I want should continue the printing function from where it stopped.
Any help will be very much appreciated.
I have not tried any measures. As I have no idea in this regards on how to stop and resume from the same point. I read on some thread that i can use (ctrl + z) to stop the process and resume it using fg shell command but I am not sure on how will it work and how ill it affect the stream of data that is worked on.
Edit: What I have in my mind is that when I stop the program mid execution, the program asks, if I want to create a checkpoint, based on the input as "y", the program creates a checkpoint and exits. Or the program could create checkpoints after every 50000 iterations so that on sudden termination the last saved checkpoint remains. On running for the second time, the program asks for a fresh start or resume from last saved checkpoint.