1

I am new to using LSF (been using PBS/Torque all along).

I need to write code/logic to make sure all bsub jobs finish before other commands/jobs can be fired.

Here is what I have done: I have a master shell script which calls multiple other shell scripts via bsub commands. I capture the job ids from bsub in a log file and I need to ensure that all jobs get finished before the downstream shell script should execute its other commands.

Master shell script

#!/bin/bash

...Code not shown for brevity..

"Command 1 invoked with multiple bsubs" > log_cmd_1.txt

Need Code logic to use bwait before downstream Commands can be used

"Command 2 will be invoked with multiple bsubs" > log_cmd_2.txt

and so on 

stdout captured from Command 1 within the Master Shell script is stored in log_cmd_1.txt which looks like this

Submitting Sample 101
Job <545> is submitted to .
Submitting Sample 102
Job <546> is submitted to .
Submitting Sample 103
Job <547> is submitted to .
Submitting Sample 104
Job <548> is submitted to .

I have used the codeblock shown below after Command 1 in the master shell script.

However, it does not seem to work for my situation. Looks like I would have gotten the whole thing wrong below.

while sleep 30m;
do
    #the below gets the JobId from the log_cmd_1.txt and tries bwait

    grep '^Job' <path_to>/log_cmd_1.txt | perl -pe 's!.*?<(\d+)>.*!$1!' | while read -r line; do res=$(bwait -w "done($line)");echo $res; done 1>
<path_to>/running.txt;
    # the below sed command deletes lines that start with Space
    sed '/^\s*$/d' running.txt > running2.txt;
    # -s file check operator means "file is not zero size"
    if [ -s $WORK_DIR/logs/running2.txt ]
        then
            echo "Jobs still running";
        else
            echo "Jobs complete";
            break;
    fi
done

The question: What's the correct way to do this using bwait within the master shell script.

Thanks in advance.

user10101904
  • 427
  • 2
  • 12

1 Answers1

2

bwait will block until the condition is satisfied, so the loops are probably not neecessary. Note that since you're using done, if the job fails then bwait will exit and inform you that the condition can never be satisfied. Make sure to check that case.

What you have should work. At least the following test worked for me.

#!/bin/bash

# "Command 1 invoked with multiple bsubs" > log_cmd_1.txt
( bsub sleep 0; bsub sleep 0 ) > log_cmd_1.txt

# Need Code logic to use bwait before downstream Commands can be used
while sleep 1
do
    #the below gets the JobId from the log_cmd_1.txt and tries bwait

    grep '^Job' log_cmd_1.txt | perl -pe 's!.*?<(\d+)>.*!$1!' | while read -r line; do res=$(bwait -w "done($line)");echo "$res"; done 1> running.txt;
    # the below sed command deletes lines that start with Space
    sed '/^\s*$/d' running.txt > running2.txt;
    # -s file check operator means "file is not zero size"
    if [ -s running2.txt ]
        then
            echo "Jobs still running";
        else
            echo "Jobs complete";
            break;
    fi
done

Another way to do it. Which may is a little cleaner, is to use job arrays and job dependencies. Job arrays will combine several pieces of work that can be managed as a single job. So your

"Command 1 invoked with multiple bsubs" > log_cmd_1.txt

could be submitted as a single job array. You'll need a driver script that can launch the individual jobs. Here's an example driver script.

$ cat runbatch1.sh 
#!/bin/bash

# $LSB_JOBINDEX goes from 1 to 10
if [ "$LSB_JOBINDEX" -eq 1 ]; then

  # do the work for job batch 1, job 1

  ...

elif [ "$LSB_JOBINDEX" -eq 2 ]; then

  # etc
  ...

fi

Then you can submit the job array like this.

bsub -J 'batch1[1-10]' sh runbatch1.sh

This command will run 10 job array elements. The driver script's environment will use the variable LSB_JOB_INDEX to let you know which element the driver is running. Since the array has a name, batch, it's easier to manage. You can submit a second job array that won't start until all elements of the first have completed successfully. The second array is submitted with this command.

bsub -w 'done(batch1)' -J 'batch2[1-10]' sh runbatch2.sh

I hope that this helps.

Michael Closson
  • 902
  • 8
  • 13
  • 1
    Hi @Micheal, Your answer is very useful - I haven't yet tried your 2nd way of doing things. I thought the 1st way (which I posted in my question) should work, but something keeps breaking and it coming out of the loop Probably, it could be because of the use of _done_ as you explain. I am now experimenting a 3rd way - I am trying to capture all JobIds using the _grep|perl_ I posted, then converting the multi-line string to a single line and joining all of them with && _bwait -w 'ended(jobid1)' && 'ended(jobid2)' &&_ so on as here: https://stackoverflow.com/a/18746666/10101904 – user10101904 Sep 17 '19 at 03:35
  • @Miheal - What's the correct way to use the _-r_ flag with bwait Is it like the below (check every 5 mins with -r) bwait -r 5 -w 'ended(1)' && -r 5 'ended(2)' && so on or bwait -r 5 -w 'ended(1)' && 'ended(2)' && so on – user10101904 Sep 17 '19 at 14:50
  • Just 1 `-r` per call to bwait. – Michael Closson Sep 17 '19 at 15:34
  • A typo `LSB_JOB_INDEX` in your explanation rather than `LSB_JOBINDEX`? – A rural reader Feb 24 '22 at 23:01