1

Is there any way / binary for a semaphore-like structure? Eg. For running a fixed amount of (background) sub-process as we loop through a directory of files (using word "sub-process" here and not "thread", since using an appended & in my bash commands to do the "multithreading" (but would be open to any more convenient suggestions)).

My actual use case is trying to use a binary called bcp on CentOS 7 to write a (variable sized) set of TSV files to a remote MSSQL Server DB and have observed that there seems to be a problem with the program when running too many threads. Eg. something like

for filename in $DATAFILES/$TARGET_GLOB; do

    if [ ! -f $filename ]; then
        echo -e "\nFile $filename not found!\nExiting..."
        exit 255
    else
        echo -e "\nImporting $filename data to $DB/$TABLE"
    fi

    echo -e "\nStarting BCP export threads for $filename"
    /opt/mssql-tools/bin/bcp "$TABLE" in "$filename" \
        $TO_SERVER_ODBCDSN \
        -U $USER -P $PASSWORD \
        -d $DB \
        $RECOMMEDED_IMPORT_MODE \
        -t "\t" \
        -e ${filename}.bcperror.log &
done
# collect all subprocesses at the end
wait

that starts a new sub-process for every file all at once in an unrestricted way, appears to crash each sub-process. Would like to see if adding a semaphore-like structure into the loop to lock the number of sub-process that will be spun up would help. Eg. something like (using some non-bash-like pseudo-code here)

sem = Semaphore(locks=5)
for filename in $DATAFILES/$TARGET_GLOB; do

    if [ ! -f $filename ]; then
        echo -e "\nFile $filename not found!\nExiting..."
        exit 255
    else
        echo -e "\nImporting $filename data to $DB/$TABLE"
    fi

    sem.lock()
    <same code from original loop>
    sem.unlock()

done
# collect all subprocesses at the end
wait

If anything like this is possible or if this is a common problem with an existing best practice solution (I'm pretty new to bash programming), advice would be appreciated.

lampShadesDrifter
  • 3,925
  • 8
  • 40
  • 102
  • I've done something similar, but using one or more dummy files that I rename (name.0, name.1, name.2, ... ) to trigger progress through a series of steps between processes or even computers on a network. – rcgldr Oct 04 '18 at 02:30
  • 1
    The [`lockfile`](https://linux.die.net/man/1/lockfile) command? – bishop Oct 04 '18 at 02:31
  • I did something similar many years ago. I wrote a number of command-line wrappers around the C semaphore APIs, but I have also used files for locking in other projects. The problem with all of these is what happens when things go wrong, when a program fails and retains the lock. Due consideration has to be given to tidying-up in your design. – cdarke Oct 04 '18 at 05:20
  • 2
    Just use **GNU Parallel** and it is simple to control how many things run in parallel... https://stackoverflow.com/a/33899532/2836621 – Mark Setchell Oct 04 '18 at 07:56
  • 2
    If you want to use your sempahore-based idea, be aware that **GNU Parallel** can do that for you too. On installation, it creates a symbolic link to itself called `sem` and you can use that as a mutex or counting semaphore (with `-j5` in your case). See https://stackoverflow.com/a/38738644/2836621 – Mark Setchell Oct 04 '18 at 08:08

3 Answers3

1

This isn't strictly equivalent, but you can use xargs to start up to a given number of processes at once:

-P max-procs, --max-procs=max-procs
      Run  up  to max-procs processes at a time; the default is 1.  If
      max-procs is 0, xargs will run as many processes as possible  at
      a  time.   Use the -n option or the -L option with -P; otherwise
      chances are that only one exec will be  done.   While  xargs  is
      running,  you  can send its process a SIGUSR1 signal to increase
      the number of commands to run simultaneously, or  a  SIGUSR2  to
      decrease  the  number.   You  cannot decrease it below 1.  xargs
      never terminates its commands; when asked to decrease, it merely
      waits  for  more  than  one existing command to terminate before
      starting another.

Something like:

printf "%s\n" $DATAFILES/$TARGET_GLOB |
  xargs -d '\n' -I {} --max-procs=5 bash -c '
    filename=$1
    if [ ! -f $filename ]; then
        echo -e "\nFile $filename not found!\nExiting..."
        exit 255
    else
        echo -e "\nImporting $filename data to $DB/$TABLE"
    fi

    echo -e "\nStarting BCP export threads for $filename"
    /opt/mssql-tools/bin/bcp "$TABLE" in "$filename" \
        $TO_SERVER_ODBCDSN \
        -U $USER -P $PASSWORD \
        -d $DB \
        $RECOMMEDED_IMPORT_MODE \
        -t "\t" \
        -e ${filename}.bcperror.log
  ' _ {}

You'll need to export the TABLE, TO_SERVER_ODBCDSN, USER, PASSWORD, DB and RECOMMEDED_IMPORT_MODE variables beforehand, so that they're available in the processes started by xargs. Or you can put commands run using bash -c here in a separate script, and put the variables in that script.

muru
  • 4,723
  • 1
  • 34
  • 78
1

Following recommendation by @Mark Setchell, using GNU Parallel to replace the loop (in a simulated cron environment (see https://stackoverflow.com/a/2546509/8236733)) with

bcpexport() {
    filename=$1
    TO_SERVER_ODBCDSN=$2
    DB=$3 
    TABLE=$4 
    USER=$5
    PASSWORD=$6
    RECOMMEDED_IMPORT_MODE=$7
    DELIMITER=$8 # DO NOT use format like "'\t'", nested quotes seem to cause hard-to-catch error
    <same code from original loop>
}
export -f bcpexport
parallel -j 10 bcpexport \
    ::: $DATAFILES/$TARGET_GLOB \
    ::: "$TO_SERVER_ODBCDSN" \
    ::: $DB \
    ::: $TABLE \
    ::: $USER \
    ::: $PASSWORD \
    ::: $RECOMMEDED_IMPORT_MODE \
    ::: $DELIMITER

to run at most 10 threads at a time, where $DATAFILES/$TARGET_GLOB is a glob string to return all of the files in the desired dir. (eg. "$storagedir/tsv/*.tsv") that we want to go through (and adding the remaining fixed args with each of the elements returned by that glob as the remaining parallel inputs shown) (The $TO_SERVER_ODBCDSN variable is actually "-D -S <some ODBC DSN>", so needed to add quotes to pass as single arg). So if the $DATAFILES/$TARGET_GLOB glob returns files A, B, C, ..., we end up running the commands

bcpexport A "$TO_SERVER_ODBCDSN" $DB ...
bcpexport B "$TO_SERVER_ODBCDSN" $DB ...
bcpexport C "$TO_SERVER_ODBCDSN" $DB ...
...

in parallel. An additionally nice thing about using parallel is

GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially.

lampShadesDrifter
  • 3,925
  • 8
  • 40
  • 102
  • 1
    While it is not wrong to use `:::` when there is a single fixed arg, it is unconventional. Normally you would just put these in the command template and use `{}` for the varying arg. It will typically also improve readability: `parallel -j 10 bcpexport fixed_arg1 {} fixed_arg2 fixed_arg3 ::: varying arg values` – Ole Tange Oct 05 '18 at 08:58
  • @OleTange In my case, one of the fixed args is a string that contains spaces (eg. `TO_SERVER_ODBCDSN="-D -S MyODBCDSN"`) and I found that when try to use the `parallel {} ::: ` syntax, the string was being split and used as separate args (despite surrounding with quotes or `${}`). IDK why that changed things, but in any case had to stick with the hacky syntax. – lampShadesDrifter Oct 05 '18 at 21:38
  • 1
    You would normally use `-q` if you do not want them expanded. – Ole Tange Oct 06 '18 at 00:23
0

Using &

Example code

#!/bin/bash
xmms2 play &
sleep 5
xmms2 next &
sleep 1
xmms2 stop