I need to convert every file in a particular directory then compile the results into a single computation on a system using slurm. The work on each individual file takes about as long as the rest of the collective calculations. Therefore, I would like the individual conversions to happen simultaneously. Sequentially, this is what I need to do:
main.sh
#!/bin/bash
#SBATCH --account=millironx
#SBATCH --time=1-00:00:00
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=4
find . -maxdepth 1 -name "*.input.txt" \
-exec ./convert-files.sh {} \;
./compile-results.sh *.output.txt
./compute.sh
echo "All Done!"
convert-files.sh
#!/bin/bash
# Simulate a time-intensive process
INPUT=${1%}
OUTPUT="${$INPUT/input.txt/output.txt}"
sleep 10
date > $OUTPUT
While this system works, I generally process batches of 30+ files, and the computational time exceeds the time limit set by the administrator while only using one node. How can I process the files in parallel then compile and compute on them after they all have been completely processed?
What I've tried/considered
Adding srun to find -exec
find . -maxdepth 1 -name "*.input.txt" \
-exec srun -n1 -N1 --exclusive ./convert-files.sh {} \;
find -exec
waits for blocking processes, and srun is blocking, so this does exactly the same thing as the base code time-wise.
Using sbatch in the submission script
find . -maxdepth 1 -name "*.input.txt" \
-exec sbatch ./convert-files.sh {} \;
This does not wait for the conversions to finish before starting the computations, and they consequently fail.
Using GNU parallel
find . -maxdepth 1 -name "*.input.txt" | \
parallel ./convert-files.sh
OR
find . -maxdepth 1 -name "*.input.txt" | \
parallel srun -n1 -N1 --exclusive ./convert-files.sh
parallel can only "see" the number of CPUs on the current node, so it only processes four files at a time. Better, but still not what I'm looking for.
Using job arrays
This method sounds promising, but I can't figure out a way to make it work since the files I'm processing don't have a sequential number in their names.
Submitting jobs separately using sbatch
At the terminal:
$ find . -maxdepth 1 -name "*.input.txt" \
> -exec sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
> ./convert-files.sh {} \;
Five hours later:
$ srun --account=millironx --time=30:00 --cpus-per-task=4 \
> ./compile-results.sh *.output.txt & \
> sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
> ./compute.sh
This is the best strategy I've come up with so far, but it means I have to remember to check on the progress of the conversion batches and initiate the computation once they are complete.
Using sbatch with a dependency
At the terminal:
$ find . -maxdepth 1 -name "*.input.txt" \
> -exec sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
> ./convert-files.sh {} \;
Submitted job xxxx01
Submitted job xxxx02
...
Submitted job xxxx45
$ sbatch --account=millironx --time=30:00 --cpus-per-task=4 \
> --dependency=after:xxxx45 --job-name=compile_results \
> ./compile-results.sh *.output.txt & \
> sbatch --account=millironx --time=05:00:00 --cpus-per-task=4 \
> --dependency=after:compile_results \
> ./compute.sh
I haven't dared to try this yet, since I know that the last job is not guaranteed to be the last to finish.
This seems like it should be such an easy thing to do, but I haven't figured it out, yet.