Suspend, save to disk, restart long jobs on a supercomputer with PBS

Question

I need to be able to suspend a "running script", have the OS save its state to disk and later resume it by reading that state and continuing exactly from where it has left. The system is a 12core compute node with a shared memory of 48GB, running linux. I have no admin rights and I login remotely using ssh. The scripts and therein executables do not use GUI, it's all command line, and as far as I know don't require expicitly network or sockets.

By "running script" (or "pipeline") I mean a bash script or a perl script or a combination of the two which spawn some C/C++ executables, possibly they are using openmp parallelisation. Or spawning in parallel executables using gnu-parallel. So, we are not talking about a single executable but a sequence of executables either running in parallel or in sequence, using implicit parallelisation over 12 cores with a common memory, glued by several unix commands (e.g. awk).

I need to suspend and restart the pipeline because the scheduler (MOAB) kills (system rules) all jobs running longer that 24h. The idea is to suspend a job and re-queue it. This technique is perfectly legitimate.

Modifying execuables' source code so that they all save state and later resuming it is not practical as it means to modify the several open-source executables to accept a 'save-state-and-suspend' signal, let's say ImageMagick's 'convert' or even a 'grep', a 'sed', an 'awk' and also perl ! Plus, there is also one executable which is closed-source, no source code.

So, I believe I am in a situation where one (the only?) practical option would be to run my 'script/pipeline' in a so-called sandbox environment, e.g. QEMU (an emulator), which can hopefully be sent a signal to 'hibernate', save the state of all currently running programs within it by just saving the whole memory and cpu state to disk (48GB not a problem) and suspending.

I am not an expert to any of the above, so pardon my terminology or if I say something not valid. I am only sketching.

To recap: I am asking any of you with experience for a solution to suspending and restarting complex script jobs under linux without resorting to modifying code to 'save state'. This solution should also take be relatively computationally efficient, i.e. not end up wasting a lot of supercomputer power for running the emulator.

If you believe that the QEMU solution I talked above is OK, then please, if you can, give some example of how to start with that, i.e. create an emulator linux image from public ISO's, load the image, run the 'script', tell the emulator to 'suspend/hibernate' after 20h, and then resume the emulator by reading it's state from the suspend state. All this, ideally from a command line or via a script.

Any other solutions, as long as they are practical (for the given setting) are welcomed.

Please note: I have no admin rights but can install things in my homedir and have lots of harddisk space. Also, the programs do not use GUI, it's all command line, and as far as I know don't require explicitly network or sockets.

As a positive side-effect of the solution with an emulator, will be that any such "pipeline" can be distributed to any OS (e.g. mac or win) where the 'sandbox'/emulator is implemented, without the complex process of recompiling everything and installing gnu-utils, bash, boost, etc.I find myself stack to this situation many times.

thanks for your help, bliako.

score 2 · Accepted Answer · answered Feb 28 '14 at 18:10

2

I'm not sure which version of pbs you're using, but TORQUE offers integration with Berkeley Lab Checkpoint/Restart (BLCR). The most important thing for BLCR is that all the nodes have the same exact OS image. Setting it up is rather detailed and documented in the TORQUE docs.

Essentially, the pbs_mom daemons are configured to use BLCR, and whenever you stop a job the daemon uses BLCR to take a snapshot of the OS internal data structures to know the exact state of the process, making it able to restart the same process from exactly the same point.

answered Feb 28 '14 at 18:10

dbeer

6,963
3
31
47

I have followed the example in the docs link you provided. I managed to have scheduler create a directory for checkpoint data whenever I tell it to checkpoint (qhold jobID). But this dir is empty, no checkpoint data is written in there. Which makes me suspect that BLCR is not in kernel or Torque was not compiled with BLCR (e.g. here: http://docs.adaptivecomputing.com/torque/4-2-6/help.htm#topics/2-jobs/introToBLCR.htm%3FTocPath%3D2.0%20Submitting%20and%20managing%20jobs%7C2.6%20Job%20checkpoint%20and%20restart%7C_____1).qsub --version gives 4.2.4.1 and lsmod|grep blcr shows nothing. – bliako Mar 04 '14 at 18:40
Did you configure with blcr at the time you built the pbs_mom daemons? – dbeer Mar 04 '14 at 20:58
I have no idea, i am just a lowly user there and the admins don't seem to have a clue about it. Is there a way to check this? Also is there a way to check if kernel was build with BLCR or BLCR kernel modules exist and are loaded? Even if all this checks OK, do you think this could work for the case i need it,i.e. a bash script is submitted to the scheduler, the script then runs a perl script, which it spawns several parallel or sequential processes (most use openmp though) glued together with shell commands (awk,sed) for results file processing.Will BLCR work in such a case or just for 1 exec? – bliako Mar 05 '14 at 11:26
First off, if you aren't an admin and don't have an admin helping you there's almost no change this will work. If they don't know what BLCR is then they haven't built TORQUE to use BLCR. On the bright side, it is possible to get your exact scenario to work, but it is a bit more difficult for jobs that are running across more than one host. – dbeer Mar 05 '14 at 17:43
thanks for your help. i will go the qemu way and see where that takes me. – bliako Mar 05 '14 at 19:52

Suspend, save to disk, restart long jobs on a supercomputer with PBS

1 Answers1