3

I have a long c program that calculates time evolution of particles and it takes almost 2 to 3 days to complete even when parallelized using OpenMP. Sometimes, when I am bound to close the workstation, I have to stop the runs mid execution and waste all the data and time that went into it.

Is there some way that I can stop the run mid execution, save all the necessary processes, switch off the PC, come back the next day and start the run from the same point it was stopped at. Maybe i can save some "snapshot" and load it the next day to start the run from the same point. I am using a Linux OS.

The structure of my c program (xyz.c) looks something like this :

%necessary header files listed here
int main (void)
{ 
   function1();
   ...
   ... % (more functions in between)
   ...
   function2();
   
   while (i < 10000000)
   {
        function1();
        function2();
        ....
        ....
        %(several more functions)
        ....
        ....
        function3();

        i = i+1;
   }
}

Most of the functions share numerous Global variables. Some functions perform calculations on global as well as local variables. And some functions print values of the variables to an output file. The resuming function I want should continue the printing function from where it stopped.

Any help will be very much appreciated.

I have not tried any measures. As I have no idea in this regards on how to stop and resume from the same point. I read on some thread that i can use (ctrl + z) to stop the process and resume it using fg shell command but I am not sure on how will it work and how ill it affect the stream of data that is worked on.

Edit: What I have in my mind is that when I stop the program mid execution, the program asks, if I want to create a checkpoint, based on the input as "y", the program creates a checkpoint and exits. Or the program could create checkpoints after every 50000 iterations so that on sudden termination the last saved checkpoint remains. On running for the second time, the program asks for a fresh start or resume from last saved checkpoint.

  • 1
    Can't you just put the PC into hibernation? – Quimby Jun 14 '23 at 06:10
  • You can put entire PC to hibernation while your program is running. After resume it will continue to run. – dimich Jun 14 '23 at 06:10
  • 1
    we generally leave the PC running for several days for multiple runs, but sometimes its unavoidable. When the battery of the UPS is about to die, when the lab will be closed for several days and several other scenarios. – Vishal Prajapati Jun 14 '23 at 06:20
  • 5
    The general technique you are looking for is called [checkpointing](https://en.m.wikipedia.org/wiki/Application_checkpointing). Effectively saving the state of the program to a file so you can restore later. You can do it completely manually, use [libraries](https://criu.org/Main_Page) , or you can use OS facilities to help. – Botje Jun 14 '23 at 06:20
  • @Botje can you please elaborate. I mean how can that be done in linux – Vishal Prajapati Jun 14 '23 at 06:23
  • How do you build the program? Is this yet another "I compile with optimizations off why is the program so slow" question...? Apart from obvious stuff like that, you need to show the actual OpenMP implemntation, thread callbacks and variable declarations. How you access and allocate data matters a lot. – Lundin Jun 14 '23 at 06:41
  • @Lundin i have not tried anything in this regards. I am seeking what ways could I proceed to get what i desire. Leaving the OpenMP aside, what can be done on a serial code to create such a checkpoint could be very helpful. – Vishal Prajapati Jun 14 '23 at 06:50
  • What do you mean "have not tried anything", are you compiling with optimizations off then? If so that's the problem, done. – Lundin Jun 14 '23 at 06:51
  • the code is optimised. I was asking a way to stop and start a c program with a small catch. – Vishal Prajapati Jun 14 '23 at 06:56
  • 1
    I edited my comment with links. One goes to the CRIU project which can checkpoint unmodified programs – Botje Jun 14 '23 at 07:02
  • Perhaps check https://unix.stackexchange.com/questions/43854/save-entire-process-for-continuation-after-reboot – Support Ukraine Jun 14 '23 at 07:02
  • 1
    Docker offers an experimental functionality : [checkpoint](https://docs.docker.com/engine/reference/commandline/checkpoint/), that may help you if you can run you program in docker.(see @Botje comment) – Mathieu Jun 14 '23 at 07:03
  • Searching for "linux save program state for later resume site:unix.stackexchange.com" gives a lot of related links – Support Ukraine Jun 14 '23 at 07:03
  • @Mathieu guess what it uses under the hood ;-) – Botje Jun 14 '23 at 07:04
  • Checkpointing a long running program is often a good idea even if you think it will not need to be interrupted, though coding and testing that the checkpointing works is non-trivial, especially when it is implemented as an afterthought. To ensure that floats etc get loaded back exactly as they were, you can use my float to and from hex functions here https://stackoverflow.com/questions/76270995/in-c-how-do-you-print-a-float-double-as-a-string-and-read-it-back-as-the-same-f/76272340#76272340 – Simon Goater Jun 14 '23 at 09:50
  • Does this answer your question? [How to "hibernate" a process in Linux by storing its memory to disk and restoring it later?](https://stackoverflow.com/questions/2134771/how-to-hibernate-a-process-in-linux-by-storing-its-memory-to-disk-and-restorin) – phuclv Jun 17 '23 at 10:11
  • duplicates: [Save a process' memory for later use?](https://stackoverflow.com/q/712876/995714), [How to "hibernate" a process in Linux by storing its memory to disk and restoring it later?](https://stackoverflow.com/q/2134771/995714), [Suspend/resume single process to/from disk](https://unix.stackexchange.com/q/23078/44425), [How can I hibernate a running application?](https://askubuntu.com/q/758350/253474), [How do you "hibernate" a process in Linux](https://www.quora.com/How-do-you-hibernate-a-process-in-Linux-by-storing-its-memory-to-a-disk-and-restoring-it-later) – phuclv Jun 17 '23 at 10:13
  • [Save entire process for continuation after reboot](https://unix.stackexchange.com/q/43854/44425). Related: [Can I "hibernate" a program?](https://superuser.com/q/275010/241386) – phuclv Jun 17 '23 at 10:13

1 Answers1

0

This kind of thing can be done with process checkpointing, which is a fairly esoteric technique. Implementations of the concept have appeared over the years in projects like Mosix, but never caught on in the mainstream.

In GNU Emacs there is code to save the process state. You may be able to integrate that into your program to do checkpointing. Beware the GPL license, of course.

You can also (effectively) do this it using a virtual machine. Run your program in a dedicated VirtualBox instance or whatever. You can close that instance and save its state at any time, and later re-start it. In between that time you can power off the host computer, and even migrate the VM, saved state and all, to another machine.

Kaz
  • 55,781
  • 9
  • 100
  • 149