0

I want to run a program that runs and creates a checkpoint file. Then I want to run several variant configurations that all start from that checkpoint.

For example, if I run:

sbatch -n 1 -t 12:00:00 --mem=16g program.sh

And program.sh looks like this:

#!/bin/sh

./set_checkpoint

sbatch -n 1 -t 12:00:00 --mem=16g cpt_restore_config1.sh
sbatch -n 1 -t 12:00:00 --mem=16g cpt_restore_config2.sh
sbatch -n 1 -t 12:00:00 --mem=16g cpt_restore_config3.sh
sbatch -n 1 -t 12:00:00 --mem=16g cpt_restore_config4.sh

Does this implement the desired effect?

Sam Thomas
  • 177
  • 10
  • I would be interested to know if you actually managed to run this. I tried with a similar script and it didn't work. I don't think it's possible to just sbatch this type of script as suggested in the answer below. – Michele Pellegrino Mar 30 '23 at 12:59
  • @MichelePellegrino Yep! I use it all the time in my work. What's the error that you have? – Sam Thomas Apr 02 '23 at 03:47
  • Well, the error is simply that calling the script with `sbatch program.sh` literally yields no output. No job is started and I don't even get any output in the standard error. – Michele Pellegrino Apr 03 '23 at 07:35
  • Is this dependent on the contents of program.sh? Does it work without making the recursive call (i.e., just stating `echo Hello world`)? I'm wondering if there is a small error going on here, because this behavior seems odd and inconsistent with what I have experienced. – Sam Thomas Apr 10 '23 at 17:12
  • I confirm: tried to run `echo hello-world > test-$SLURM_JOB_PARTITION.txt` and submit to three different partitions using a script similar to yours. Nothing in the slurm output and error, no test output file created... – Michele Pellegrino Apr 12 '23 at 07:02

2 Answers2

1

In general this is not needed. You can allocate all the resources you want in main job script and use resources for specific task with srun. Here is a basic example.

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=2
#SBATCH --time=01:00:00

module load some_module
srun -n 4 -c 2 ./my_program arg1 arg2
srun -n 4 -c 2 ./my_other_program arg1 arg2

Note that we allocated 8 CPUs and used 4 for each task. Here, the two srun tasks will run sequentially. To make it run in parallel, you can use this trick.

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=2
#SBATCH --time=01:00:00

srun -n 4 -c 2 ./my_program arg1 arg2 &
srun -n 4 -c 2 ./my_other_program arg1 arg2 &

wait

Just keep in mind this might not work in several cases. I would suggest using a logger and redirect the STDOUT and STDERR to a file. Here is a simple example.

Alternatively, if your tasks are using a single file with different set of parameters, I suggest using argument parsing. In Python, I generally use Hydra's joblib extension. It gives you parallelism capability out of the box.

Prakhar Sharma
  • 543
  • 9
  • 19
0

Based on the comments, it seems that sbatch is not guaranteed to work recursively for some reason. I've recently encountered similar issues, and I could get around the limitation by running my main script from the same shell I'm launching it from. In your case, this would mean using source program.sh instead of sbatch program.sh.