15

I have a while loop in Bash handled like this:

while IFS=$'\t' read -r -a line;
do
    myprogram ${line[0]} ${line[1]} ${line[0]}_vs_${line[1]}.result;
done < fileinput

It reads from a file with this structure, for reference:

foo   bar
baz   foobar

and so on (tab-delimited).

I would like to parallelize this loop (since the entries are a lot and processing can be slow) using GNU parallel, however the examples are not clear on how I would assign each line to the array, like I do here.

What would be a possible solution (alternatives to GNU parallel work as well)?

Einar
  • 4,727
  • 7
  • 49
  • 64

3 Answers3

11

I'd like @chepner hack. And it seems not so tricky accomplish similar behaviour with limiting number of parallel executions:

while IFS=$'\t' read -r f1 f2;
do
    myprogram "$f1" "$f2" "${f1}_vs_${f2}.result" &

    # At most as number of CPU cores
    [ $( jobs | wc -l ) -ge $( nproc ) ] && wait
done < fileinput

wait

It limit execution at max of number of CPU cores present on system. You may easily vary that by replace $( nproc ) by desired amount.

Meantime you should understand what it is not honest distribution. So, it not start new thread just after one finished. Instead it just wait finishing all, after start max amount. So summary throughput may be slightly less than with parallel. Especially if run time of your program may vary in big range. If time spent on each invocation is almost same then summary time also should be roughly equivalent.

Hubbitus
  • 5,161
  • 3
  • 41
  • 47
  • 1
    I like this solution, but instead of `wait` I used `while [ $( jobs | wc -l ) -ge $( nproc ) ]; do sleep 3; done` ? – Syco Oct 01 '20 at 14:01
  • Are you speaking about the first `wait`? Yes, this may have the sence – Hubbitus Oct 02 '20 at 14:12
9

From https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Use-a-table-as-input:

"""
Content of table_file.tsv:

foo<TAB>bar
baz <TAB> quux

To run:

cmd -o bar -i foo
cmd -o quux -i baz

you can run:

parallel -a table_file.tsv --colsep '\t' cmd -o {2} -i {1}

"""

So in your case it will be:

cat fileinput | parallel --colsep '\t' myprogram {1} {2} {1}_vs_{2}.result
Alessandro Cosentino
  • 2,268
  • 1
  • 21
  • 30
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
9

parallel isn't strictly necessary here; just start all the processes in the background, then wait for them to complete. The array is also unnecessary, as you can give read more than one variable to populate:

while IFS=$'\t' read -r f1 f2;
do
    myprogram "$f1" "$f2" "${f1}_vs_${f2}.result" &
done < fileinput
wait

This does start a single job for every item in your list, whereas parallel can limit the number of jobs running at once. You can accomplish the same in bash, but it's tricky.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • Is there a way to do this with a max process cap? Otherwise running it on large input blows up -- edit, nevermind see Hubbitus's answer – nmr Oct 05 '17 at 18:49
  • Is it possible to have several commands together? Like, where would I put "&" in the following to get a process for each line in my file? ``` while read l; do myprogram $l if [ "$?" -eq 0 ]; then X="Succes"; else X="failure"; fi result="${result}\n${l}: ${X}" done < file.txt wait echo result ``` https://pastebin.com/snBQKmhH – Olsgaard Sep 30 '20 at 12:18