1

I'm running a shell script script.sh in parallel which in each of its lines goes to a folder and run a Fortran code:

cd folder1 && ./code & 
cd folder2 && ./code &
cd folder3 && ./code &
cd folder4 && ./code &
..
cd folder 96 && ./code
wait 
cd folder 97 && ./code
..
..
..
cd folder2500 && ./code.sh

There are around 2500 Folders and code outputs are independent from each other. I have access to 96 CPUs and each job uses around 1% of CPU, so I run 96 jobs in parallel using the & key and wait command. Due to different reasons, not all 96 jobs finish at the same time. Some of them take 40 minutes, some of them 90 minutes, an important difference. So I was wondering if it is possible that the jobs that finish earlier use the available CPUs in order to optimize the execution time.

I tried also with GNU Parallel:

parallel -a script.sh but it had the same issue, and I could not find in internet somebody with a similar problem.

  • You have not explained what "the issue" is. Why can't you start all of the jobs and let the OS do its job of scheduling them? – Scott Hunter Jan 24 '23 at 18:59

2 Answers2

3

You can use GNU Parallel:

parallel 'cd {} && ./code' ::: folder*

That will keep all your cores busy, starting a new job immediately as each job finishes.


If you only want to run 48 jobs in parallel, use:

parallel -j 48 ...

If you want to do a dry run and see what would run but without actually running anything, use:

parallel --dry-run ...

If you want to see a progress report, use:

parallel --progress ...
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
2

One bash/wait -n approach:

jobmax=96
jobcnt=0

for ((i=1;i<=2500;i++))
do
    ((++jobcnt))
    [[ "${jobcnt}" -gt "${jobmax}" ]] && wait -n && ((--jobcnt))   # if jobcnt > 96 => wait for a job to finish, decrement jobcnt, then continue with next line ...
    ( cd "folder$i" && ./code ) &                                  # kick off new job
done

wait                                                               # wait for rest of jobs to complete

NOTES:

  • when the jobs complete quickly (eg, < 1 sec) it's possible that more than one job could complete during the wait -n; start new job; wait -n cycle, in which case you could end up with less than jobmax jobs running at a time (ie, jobcnt is higher than the actual number of running jobs)
  • however, in this scenario where each job is expected to take XX minutes to complete the likelihood of multiple jobs completing during the wait -n; start new job; wait -n cycle should be greatly diminished (if not eliminated)
markp-fuso
  • 28,790
  • 4
  • 16
  • 36