Multithreading semaphore for bash script (sub-processes)

Question

Is there any way / binary for a semaphore-like structure? Eg. For running a fixed amount of (background) sub-process as we loop through a directory of files (using word "sub-process" here and not "thread", since using an appended & in my bash commands to do the "multithreading" (but would be open to any more convenient suggestions)).

My actual use case is trying to use a binary called bcp on CentOS 7 to write a (variable sized) set of TSV files to a remote MSSQL Server DB and have observed that there seems to be a problem with the program when running too many threads. Eg. something like

for filename in $DATAFILES/$TARGET_GLOB; do

    if [ ! -f $filename ]; then
        echo -e "\nFile $filename not found!\nExiting..."
        exit 255
    else
        echo -e "\nImporting $filename data to $DB/$TABLE"
    fi

    echo -e "\nStarting BCP export threads for $filename"
    /opt/mssql-tools/bin/bcp "$TABLE" in "$filename" \
        $TO_SERVER_ODBCDSN \
        -U $USER -P $PASSWORD \
        -d $DB \
        $RECOMMEDED_IMPORT_MODE \
        -t "\t" \
        -e ${filename}.bcperror.log &
done
# collect all subprocesses at the end
wait

that starts a new sub-process for every file all at once in an unrestricted way, appears to crash each sub-process. Would like to see if adding a semaphore-like structure into the loop to lock the number of sub-process that will be spun up would help. Eg. something like (using some non-bash-like pseudo-code here)

sem = Semaphore(locks=5)
for filename in $DATAFILES/$TARGET_GLOB; do

    if [ ! -f $filename ]; then
        echo -e "\nFile $filename not found!\nExiting..."
        exit 255
    else
        echo -e "\nImporting $filename data to $DB/$TABLE"
    fi

    sem.lock()
    <same code from original loop>
    sem.unlock()

done
# collect all subprocesses at the end
wait

If anything like this is possible or if this is a common problem with an existing best practice solution (I'm pretty new to bash programming), advice would be appreciated.

I've done something similar, but using one or more dummy files that I rename (name.0, name.1, name.2, ... ) to trigger progress through a series of steps between processes or even computers on a network. — rcgldr, Oct 04 '18 at 02:30
The [`lockfile`](https://linux.die.net/man/1/lockfile) command? — bishop, Oct 04 '18 at 02:31
I did something similar many years ago. I wrote a number of command-line wrappers around the C semaphore APIs, but I have also used files for locking in other projects. The problem with all of these is what happens when things go wrong, when a program fails and retains the lock. Due consideration has to be given to tidying-up in your design. — cdarke, Oct 04 '18 at 05:20
Just use **GNU Parallel** and it is simple to control how many things run in parallel... https://stackoverflow.com/a/33899532/2836621 — Mark Setchell, Oct 04 '18 at 07:56
If you want to use your sempahore-based idea, be aware that **GNU Parallel** can do that for you too. On installation, it creates a symbolic link to itself called `sem` and you can use that as a mutex or counting semaphore (with `-j5` in your case). See https://stackoverflow.com/a/38738644/2836621 — Mark Setchell, Oct 04 '18 at 08:08

score 1 · Answer 1 · answered Oct 04 '18 at 05:22

This isn't strictly equivalent, but you can use xargs to start up to a given number of processes at once:

-P max-procs, --max-procs=max-procs
      Run  up  to max-procs processes at a time; the default is 1.  If
      max-procs is 0, xargs will run as many processes as possible  at
      a  time.   Use the -n option or the -L option with -P; otherwise
      chances are that only one exec will be  done.   While  xargs  is
      running,  you  can send its process a SIGUSR1 signal to increase
      the number of commands to run simultaneously, or  a  SIGUSR2  to
      decrease  the  number.   You  cannot decrease it below 1.  xargs
      never terminates its commands; when asked to decrease, it merely
      waits  for  more  than  one existing command to terminate before
      starting another.

Something like:

printf "%s\n" $DATAFILES/$TARGET_GLOB |
  xargs -d '\n' -I {} --max-procs=5 bash -c '
    filename=$1
    if [ ! -f $filename ]; then
        echo -e "\nFile $filename not found!\nExiting..."
        exit 255
    else
        echo -e "\nImporting $filename data to $DB/$TABLE"
    fi

    echo -e "\nStarting BCP export threads for $filename"
    /opt/mssql-tools/bin/bcp "$TABLE" in "$filename" \
        $TO_SERVER_ODBCDSN \
        -U $USER -P $PASSWORD \
        -d $DB \
        $RECOMMEDED_IMPORT_MODE \
        -t "\t" \
        -e ${filename}.bcperror.log
  ' _ {}

You'll need to export the TABLE, TO_SERVER_ODBCDSN, USER, PASSWORD, DB and RECOMMEDED_IMPORT_MODE variables beforehand, so that they're available in the processes started by xargs. Or you can put commands run using bash -c here in a separate script, and put the variables in that script.

lampShadesDrifter · Accepted Answer · 2018-10-06T06:18:04.070

Following recommendation by @Mark Setchell, using GNU Parallel to replace the loop (in a simulated cron environment (see https://stackoverflow.com/a/2546509/8236733)) with

bcpexport() {
    filename=$1
    TO_SERVER_ODBCDSN=$2
    DB=$3 
    TABLE=$4 
    USER=$5
    PASSWORD=$6
    RECOMMEDED_IMPORT_MODE=$7
    DELIMITER=$8 # DO NOT use format like "'\t'", nested quotes seem to cause hard-to-catch error
    <same code from original loop>
}
export -f bcpexport
parallel -j 10 bcpexport \
    ::: $DATAFILES/$TARGET_GLOB \
    ::: "$TO_SERVER_ODBCDSN" \
    ::: $DB \
    ::: $TABLE \
    ::: $USER \
    ::: $PASSWORD \
    ::: $RECOMMEDED_IMPORT_MODE \
    ::: $DELIMITER

to run at most 10 threads at a time, where $DATAFILES/$TARGET_GLOB is a glob string to return all of the files in the desired dir. (eg. "$storagedir/tsv/*.tsv") that we want to go through (and adding the remaining fixed args with each of the elements returned by that glob as the remaining parallel inputs shown) (The $TO_SERVER_ODBCDSN variable is actually "-D -S <some ODBC DSN>", so needed to add quotes to pass as single arg). So if the $DATAFILES/$TARGET_GLOB glob returns files A, B, C, ..., we end up running the commands

bcpexport A "$TO_SERVER_ODBCDSN" $DB ...
bcpexport B "$TO_SERVER_ODBCDSN" $DB ...
bcpexport C "$TO_SERVER_ODBCDSN" $DB ...
...

in parallel. An additionally nice thing about using parallel is

GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially.

While it is not wrong to use `:::` when there is a single fixed arg, it is unconventional. Normally you would just put these in the command template and use `{}` for the varying arg. It will typically also improve readability: `parallel -j 10 bcpexport fixed_arg1 {} fixed_arg2 fixed_arg3 ::: varying arg values` — Ole Tange, Oct 05 '18 at 08:58
@OleTange In my case, one of the fixed args is a string that contains spaces (eg. `TO_SERVER_ODBCDSN="-D -S MyODBCDSN"`) and I found that when try to use the `parallel {} ::: ` syntax, the string was being split and used as separate args (despite surrounding with quotes or `${}`). IDK why that changed things, but in any case had to stick with the hacky syntax. — lampShadesDrifter, Oct 05 '18 at 21:38
You would normally use `-q` if you do not want them expanded. — Ole Tange, Oct 06 '18 at 00:23

score 0 · Answer 3 · answered Oct 04 '18 at 05:20

0

Using &

Example code

#!/bin/bash
xmms2 play &
sleep 5
xmms2 next &
sleep 1
xmms2 stop

answered Oct 04 '18 at 05:20

Sputnik

1

Multithreading semaphore for bash script (sub-processes)

3 Answers3