shell script not exiting if there is error in a nested function while using wait

Question

In my shell script if several nested functions are running concurrently and fails, the scripts does not fail.

status() {
    exit_st=$1
    error_1=$2
    if ! [[ $exit_st -eq 0 ]]; then
        echo "[ERROR] -  ${error_1}"
        exit 1
    else 
        echo "[INFO] -  ${error_1}"
    fi
}

abc(){
   val1= $1
   val2= $2
   #Some SQL command here
   status $? "SQL command step"
}

abc cmd1 cmd2 &

abc cmd3 cmd4 &

wait 

echo 'hi'

In the above code if the command at #Some SQL command here fails, then the script does not exit and proceeds to print hi. I tried changing the exit to return but it does not error out. I want that if any abc job fails, then the entire script should exit with a non zero code.

My bash version is GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu) so I am not able to use the wait -n option

`wait` instructs the shell to wait for all jobs. If one of them fails, it will still wait for the others. What do you want to do with the jobs that are still running? Do you want to send them a signal so they terminate? — William Pursell, Apr 25 '23 at 18:45
Running the `abc` function in the background (with `&`) runs it in a subshell; running `exit` in a subshell exits *only that subshell*, not the parent shell (or any other subshells). — Gordon Davisson, Apr 25 '23 at 23:42
@WilliamPursell I want other jobs to fail as well and exit the script with a non-zero exit code. Right now it proceeds with printing `hi` in the output and exits successfully — dijeah, Apr 26 '23 at 11:36
See [How to wait in bash for several subprocesses to finish and return exit code !=0 when any subprocess ends with code !=0?](https://stackoverflow.com/q/356100/4154375). — pjh, Apr 26 '23 at 12:44
@GordonDavisson What do you mean when you say you "want other jobs to fail"? What is the mechanism by which they should fail? Do they have any mechanism for communicating, or do you want the parent shell to send them a signal (or communicate with them in some other way?) — William Pursell, Apr 26 '23 at 20:16

pjh · Answer 1 · 2023-04-26T22:37:48.800

Bash does not exit automatically when a background process exits with non-zero status, even if set -e (set -o errexit) is active. If you want your program to exit when a background process fails then you'll need to explicitly detect the failure.

If you've got Bash 4.3 (released in 2014) or later then you can do it in the code in the question by replacing

wait

with

while wait -n; do
    :
done

The -n option to wait was introduced in Bash 4.3. It causes wait to wait for the next background process to exit, and returns its status.
The loop runs until wait -n returns non-zero status; either because a background process exited with non-zero status or all background processes have exited with zero status.
See ProcessManagement - Greg's Wiki for more information about wait -n, and general information about handling background processes in Bash. That page says to run set -m in programs that use wait -n. I haven't found that to be necessary, but that may be because I am using a much later version of Bash. YMMV.

Although the while wait ... loop allows the program to continue as soon as a background process fails (e.g. so it can call exit) it may leave other background processes still running, even after the main program exits. That could lead to unwanted processing and/or unexpected output to the terminal. You might want to kill any remaining background processes after the while wait ... loop terminates. One way to do that is:

jobs_output=$(jobs)
while IFS= read -r line; do
    jobnum=${line#*\[}
    jobnum=${jobnum%%\]*}
    kill "%$jobnum"
done <<<"$jobs_output"

The jobs command outputs one line for each active background process. Each line starts with a job number in square brackets. See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for explanations of ${line#*\[} and ${jobnum%%\]*} (used to extract the job number from a line).

With versions of Bash older than 4.3 my preferred option for managing background processes is to use the jobs command in a polling loop.

This is a, Shellcheck-clean, modified version of your program that demonstrates the technique:

#! /bin/bash -p

function status
{
    local -r exit_st=$1
    local -r error_1=$2

    if (( exit_st == 0 )); then
        printf '[INFO] - %s\n' "$error_1" >&2
    else 
        printf '[ERROR] - %s\n' "$error_1" >&2
        exit 1
    fi
}

function abc
{
   local -r val1=$1
   local -r val2=$2

   run_sql_command "$val1" "$val2"
   status "$?" 'SQL command step'
}

# Wait for background processes (specified by PIDs given as function
# arguments) to complete.
# If any background process completes with non-zero exit status, return
# immediately (without waiting for any other background processes) using the
# failed process's exit status as the return status.
function wait_for_pids
{
    local -r bgpids=( "$@" )

    # Use a sparse array ('is_active_pids') indexed by PID values to maintain
    # a set of background processes that are still active
    local pid is_active_pid=()
    for pid in "${bgpids[@]}"; do
        is_active_pid[pid]=1
    done

    local jobs_output old_active_pids=()
    while (( ${#is_active_pid[*]} > 0 )); do
        # Get a list of PIDs of background processes that are still active
        jobs_output=$(jobs -pr)
        IFS=$'\n' read -r -d '' -a active_pids <<<"$jobs_output"

        old_active_pids=( "${!is_active_pid[@]}" )

        # Update the set of still active background PIDs
        is_active_pid=()
        for pid in ${active_pids[@]+"${active_pids[@]}"}; do
            is_active_pid[pid]=1
        done

        # Find processes that are no longer active (i.e. they have exited)
        # and check their exit statuses
        for pid in "${old_active_pids[@]}"; do
            if (( ! ${is_active_pid[pid]-0} )); then
                wait "$pid" || return "$?"
            fi
        done

        sleep 1
    done
}

# Kill all background processes that are running, and exit the program
# with the exit status provided as an argument
function kill_running_jobs_and_exit
{
    local -r exit_status=$1

    local jobs_output line jobnum
    jobs_output=$(jobs -r)
    while IFS= read -r line; do
        [[ $line == *\[*\]* ]] || continue
        jobnum=${line#*\[}
        jobnum=${jobnum%%\]*}
        # Kill by job number instead of PID because killing by PID is
        # subject to race conditions that may cause the wrong process to be
        # killed
        kill "%$jobnum"
        printf '[INFO] - Killed: %s\n' "$line" >&2
    done <<<"$jobs_output"

    exit "$exit_status"
}

bgpids=()

abc cmd1 cmd2 &
bgpids+=( "$!" )

abc cmd3 cmd4 &
bgpids+=( "$!" )

wait_for_pids "${bgpids[@]}" || kill_running_jobs_and_exit "$?"

echo 'hi'

Several of changes are minor ones to fix Shellcheck warnings or to convert to standard or best practices (e.g. sending diagnostic output to standard error and using printf instead of echo).
One significant change is that an array, bgpids, is used to keep a list of PIDs of background processes.
Another significant change is the addition of two new functions: wait_for_pids and kill_running_jobs_and_exit.
The final significant change is replacing wait with wait_for_pids "${bgpids[@]}" || kill_running_jobs_and_exit "$?".
Replace run_sql_command "$val1" "$val2" with whatever is appropriate for you. I wrote and used a function called run_sql_command for testing.
I've used a polling interval of one second (sleep 1). Something different might be better for you (e.g. sleep 10 or (if your sleep supports floating point arguments) sleep 0.1).
See the Sparse Arrays section of BashGuide/Arrays - Greg's Wiki for information about how the is_active_pid array is used.
${active_pids[@]+"${active_pids[@]}"} is used instead of "${active_pids[@]} to work around a bug in older versions of Bash that caused it to mishandle empty arrays when set -o nounset (set -u) is in effect. See bash empty array expansion with 'set -u'.
I tested the code with Bash version 3.2. It should work with all later versions of Bash too.

I can see that the `kill` command terminates the job but proceeds with the rest of the script. I do not want it to proceed with the rest of the script. Rather I wan the script to exit with a non-zero code. Same is the behavior with the `while wait` loop — dijeah, Apr 26 '23 at 11:39
I found my bash version is not 4.3: `GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)` — dijeah, Apr 26 '23 at 12:07
Bash 4.2 doesn't have `wait -n`, so the code in my answer won't work for you. Sorry. You'll find alternatives on the [ProcessManagement - Greg's Wiki](https://mywiki.wooledge.org/ProcessManagement) page. You might also find useful techniques in [Parallelize Bash script with maximum number of processes](https://stackoverflow.com/q/38160/4154375). — pjh, Apr 26 '23 at 12:34

shell script not exiting if there is error in a nested function while using wait

1 Answers1