Process group of files in parallel then compute in series using slurm

Question

I need to convert every file in a particular directory then compile the results into a single computation on a system using slurm. The work on each individual file takes about as long as the rest of the collective calculations. Therefore, I would like the individual conversions to happen simultaneously. Sequentially, this is what I need to do:

main.sh

#!/bin/bash
#SBATCH --account=millironx
#SBATCH --time=1-00:00:00
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=4

find . -maxdepth 1 -name "*.input.txt" \
  -exec ./convert-files.sh {} \;

./compile-results.sh *.output.txt

./compute.sh

echo "All Done!"

convert-files.sh

#!/bin/bash
# Simulate a time-intensive process
INPUT=${1%}
OUTPUT="${$INPUT/input.txt/output.txt}"
sleep 10
date > $OUTPUT

While this system works, I generally process batches of 30+ files, and the computational time exceeds the time limit set by the administrator while only using one node. How can I process the files in parallel then compile and compute on them after they all have been completely processed?

What I've tried/considered

Adding srun to `find -exec`

find . -maxdepth 1 -name "*.input.txt" \
  -exec srun -n1 -N1 --exclusive ./convert-files.sh {} \;

find -exec waits for blocking processes, and srun is blocking, so this does exactly the same thing as the base code time-wise.

Using sbatch in the submission script

find . -maxdepth 1 -name "*.input.txt" \
  -exec sbatch ./convert-files.sh {} \;

This does not wait for the conversions to finish before starting the computations, and they consequently fail.

Using GNU parallel

find . -maxdepth 1 -name "*.input.txt" | \
  parallel ./convert-files.sh

OR

find . -maxdepth 1 -name "*.input.txt" | \
  parallel srun -n1 -N1 --exclusive ./convert-files.sh

parallel can only "see" the number of CPUs on the current node, so it only processes four files at a time. Better, but still not what I'm looking for.

Using job arrays

This method sounds promising, but I can't figure out a way to make it work since the files I'm processing don't have a sequential number in their names.

Submitting jobs separately using sbatch

At the terminal:

$ find . -maxdepth 1 -name "*.input.txt" \
>  -exec sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>  ./convert-files.sh {} \;

Five hours later:

$ srun --account=millironx --time=30:00 --cpus-per-task=4 \
>   ./compile-results.sh *.output.txt & \
>   sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>   ./compute.sh

This is the best strategy I've come up with so far, but it means I have to remember to check on the progress of the conversion batches and initiate the computation once they are complete.

Using sbatch with a dependency

At the terminal:

$ find . -maxdepth 1 -name "*.input.txt" \
>  -exec sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>  ./convert-files.sh {} \;
Submitted job xxxx01
Submitted job xxxx02
...
Submitted job xxxx45
$ sbatch --account=millironx --time=30:00 --cpus-per-task=4 \
>   --dependency=after:xxxx45 --job-name=compile_results \
>   ./compile-results.sh *.output.txt & \
>   sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
>   --dependency=after:compile_results \
>   ./compute.sh

I haven't dared to try this yet, since I know that the last job is not guaranteed to be the last to finish.

This seems like it should be such an easy thing to do, but I haven't figured it out, yet.

What does `$SLURM_NODELIST` or `$SLURM_JOB_NODELIST` contain? — Ole Tange, May 14 '19 at 05:07

score 1 · Answer 1 · answered May 14 '19 at 05:11

1

If your $SLURM_NODELIST contain something similar to node1,node2,node34, then this might work:

find ... | parallel -S $SLURM_NODELIST convert_files

answered May 14 '19 at 05:11

Ole Tange

31,768
5
86
104

score 1 · Accepted Answer · answered May 15 '19 at 14:28

The find . -maxdepth 1 -name "*.input.txt" | parallel srun -n1 -N1 --exclusive ./convert-files.sh way probably the one to follow. But it seems ./convert-files.sh expect the filename as argument, and you are trying to push it to stdin through the pipe. You need to use xargs, and as xargs can work in parallel, you do not need the parallel command.

Try:

find . -maxdepth 1 -name "*.input.txt" | xargs -L1 -P$SLURM_NTASKS srun -n1 -N1 --exclusive ./convert-files.sh

-L1 will split the result of find per line, and feed it to convert.sh, spawning maximum $SLURM_NTASKS processes at a time, and 'sending' each of them to a CPU on the nodes allocated by Slurm thanks to srun -n1 -N1 --exclusive.