1

I have a directory with several sub-directories with names

1
2
3
4
backup_1
backup_2

I wrote a parallelized bash code to process files in these folders and a minimum working example is as follows:

#!/bin/bash
P=`pwd`
task(){
    dirname=$(basename $dir)
    echo $dirname running >> output.out
    if [[ $dirname != "backup"* ]]; then
        sed -i "s/$dirname running/$dirname is good/" $P/output.out
    else
        sed -i "s/$dirname running/$dirname ignored/" $P/output.out
    fi
}

for dir in */; do
    ((i=i%8)); ((i++==0)) && wait
    task "$dir" &
done
wait
echo all done

The "wait" at the end of the script is supposed to wait for all processes to finish before proceeding to echo "all done". The output.out file, after all processes are finished should show

1 is good
2 is good
3 is good
4 is good
backup_1 ignored
backup_2 ignored

I am able to get this output if I set the script to run in serial with ((i=i%1)); ((i++==0)) && wait. However, if I run it in parallel with ((i=i%2)); ((i++==0)) && wait, I get something like

2 is good
1 running
3 running
4 is good
backup_1 running
backup_2 ignored

Can anyone tell me why is wait not working in this case?

I also know that GNU parallel can do the same thing in parallelizing tasks. However, I don't know how to command parallel to run this task on all sub-directories in the parent directory. It'll be great is someone can produce a sample script that I can follow.

Many thanks Jacek

Jacek
  • 571
  • 1
  • 3
  • 12
  • 2
    `sed -i` replaces the file; so some of your processes are writing to different files (That are deleted when overwritten). – Shawn Nov 19 '21 at 07:08
  • 1
    You can avoid race conditions by doing something like having each one write to its own unique log, and then merging them all together at the end for convenience. – Shawn Nov 19 '21 at 07:39
  • (0) Why would you expect a particular ordering of the output when running stuff in parallel? Give each process a separate output file, assemble outputs after they finish. (1) Always use `local` variables in functions. (Sure, a `&`-executed process cannot influence its parent, but it’s a good practice and a necessity in the single-process case.) – Andrej Podzimek Nov 19 '21 at 11:15
  • (2) The solution is flawed, because it waits for *all* 8 processes to finish before starting new ones. If one process takes 10-times longer than the rest, this is inefficient. You can instead [always run 8 parallel processes](https://stackoverflow.com/questions/69451800/how-to-run-multiple-tasks-in-the-same-time-in-a-loop/69452918#69452918). – Andrej Podzimek Nov 19 '21 at 11:15

1 Answers1

0

A literal porting to GNU Parallel looks like this:

task(){
    dir="$1"
    P=`pwd`
    dirname=$(basename $dir)
    echo $dirname running >> output.out
    if [[ $dirname != "backup"* ]]; then
        sed -i "s/$dirname running/$dirname is good/" $P/output.out
    else
        sed -i "s/$dirname running/$dirname ignored/" $P/output.out
    fi
}
export -f task

parallel -j8 task ::: */
echo all done

As others point out you have race conditions when you run sed on the same file in parallel.

To avoid race conditions you could do:

task(){
    dir="$1"
    P=`pwd`
    dirname=$(basename $dir)
    echo $dirname running
    if [[ $dirname != "backup"* ]]; then
        echo "$dirname is good" >&2
    else
        echo "$dirname ignored" >&2
    fi
}
export -f task

parallel -j8 task ::: */ >running.out 2>done.out
echo all done

You will end up with two files running.out and done.out.

If you really just want to ignore the dirs called backup*:

task(){
    dir="$1"
    P=`pwd`
    dirname=$(basename $dir)
    echo $dirname running
    echo "$dirname is good" >&2
}
export -f task

parallel -j8 task '{=/backup/ and skip()=}' ::: */ >running.out 2>done.out
echo all done

Consider spending 20 minutes on reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 Your command line will love you for it.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104