Run shell script in parallel with more jobs than CPUs, and after a job is finished, instantly take the available spot

Question

I'm running a shell script script.sh in parallel which in each of its lines goes to a folder and run a Fortran code:

cd folder1 && ./code & 
cd folder2 && ./code &
cd folder3 && ./code &
cd folder4 && ./code &
..
cd folder 96 && ./code
wait 
cd folder 97 && ./code
..
..
..
cd folder2500 && ./code.sh

There are around 2500 Folders and code outputs are independent from each other. I have access to 96 CPUs and each job uses around 1% of CPU, so I run 96 jobs in parallel using the & key and wait command. Due to different reasons, not all 96 jobs finish at the same time. Some of them take 40 minutes, some of them 90 minutes, an important difference. So I was wondering if it is possible that the jobs that finish earlier use the available CPUs in order to optimize the execution time.

I tried also with GNU Parallel:

parallel -a script.sh but it had the same issue, and I could not find in internet somebody with a similar problem.

You have not explained what "the issue" is. Why can't you start all of the jobs and let the OS do its job of scheduling them? — Scott Hunter, Jan 24 '23 at 18:59

Mark Setchell · Answer 1 · 2023-01-24T19:31:23.340

3

You can use GNU Parallel:

parallel 'cd {} && ./code' ::: folder*

That will keep all your cores busy, starting a new job immediately as each job finishes.

If you only want to run 48 jobs in parallel, use:

parallel -j 48 ...

If you want to do a dry run and see what would run but without actually running anything, use:

parallel --dry-run ...

If you want to see a progress report, use:

parallel --progress ...

edited Jan 24 '23 at 19:31

answered Jan 24 '23 at 19:05

Mark Setchell

191,897
31
273
432

markp-fuso · Answer 2 · 2023-01-24T19:25:25.827

One bash/wait -n approach:

jobmax=96
jobcnt=0

for ((i=1;i<=2500;i++))
do
    ((++jobcnt))
    [[ "${jobcnt}" -gt "${jobmax}" ]] && wait -n && ((--jobcnt))   # if jobcnt > 96 => wait for a job to finish, decrement jobcnt, then continue with next line ...
    ( cd "folder$i" && ./code ) &                                  # kick off new job
done

wait                                                               # wait for rest of jobs to complete

NOTES:

when the jobs complete quickly (eg, < 1 sec) it's possible that more than one job could complete during the wait -n; start new job; wait -n cycle, in which case you could end up with less than jobmax jobs running at a time (ie, jobcnt is higher than the actual number of running jobs)
however, in this scenario where each job is expected to take XX minutes to complete the likelihood of multiple jobs completing during the wait -n; start new job; wait -n cycle should be greatly diminished (if not eliminated)

This worked wonderfully :). Thank you thank you. I will monitor that no issues happen in the `wait -n` cycle — Sebastián Alfonso Núñez Jara, Jan 25 '23 at 09:52

Run shell script in parallel with more jobs than CPUs, and after a job is finished, instantly take the available spot

2 Answers2