multiprocessing for bash loop

Question

I have a non-trivial Bash script taking roughly the following form:

# Initialization

<generate_data> | while read line; do

    # Run tests and filters on line

    if [ "$tests_pass" ]; then
        echo "$filtered_line"
    fi

done | sort <sort_option> | <consume_data>

# Finalization

Compared to the filter, the generator consumes minimal processing resources, and, of course, the sort operation cannot begin until all filtered data is available. As such, the filter, a cascade of several loops and conditionals written natively in Bash, is the processing bottleneck, and the single process running this loop consumes an entire core.

A useful objective would be to distribute this logic across several child processes that each run separate filter loops, and which, in turn, each consume blocks of lines from the generator, and which each produce output blocks concatenated into the sort operation. Functionality of this kind is available through tools such as GNU Parallel, but using them requires invoking an external command to run in the pipe.

Is any convenient tool or feature available that makes the operations on the script distributable across multiple processes without disrupting the overall structure of the script? I am not aware of a Bash builtin feature, but one surely would be useful.

https://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop shows some examples about parallel loop processing in bash, Hope it can be useful for you. — gzh, Dec 02 '19 at 04:44
Thanks. I had read that topic (103920), but had not discovered anything that leads me to a solution to this problem, as constructed. Have you? — brainchild, Dec 02 '19 at 04:51
@gregory That would work similar to GNU `parallel` which OP does not want (even though *`invoking an external command`* is not nearly as bad as running a loop in bash). @epl It might be possible to speed up your filter sufficiently enough without having to resort to parallel computations. With a minimal input and expected output someone might give you a solution here. — Socowi, Dec 02 '19 at 08:32
@Socowi I am developing an optimized implementation of the filter logic. I require no particular help with this work. But the benefit from such an improvement applied by itself is far inferior to that from utilizing more hardware in parallel. The issue with *invoking an external command* is the lack of code manageability with respect to moving the filter logic into some command that can be called independently. — brainchild, Dec 02 '19 at 09:24
*`such an improvement applied by itself is far inferior to that from utilizing more hardware in parallel`* I wouldn't count on it. Loops in bash are so slow, even in parallel they often cannot outrun other languages or even specialized tools. Example: To generate the numbers from 1 to 4'000'000 I compared the [following approaches](https://www.codepile.net/raw/e53Re7EN) on a quad core. One bash loop (16.1s); four bash loops in parallel (5.2s); one awk loop (0.9s); and `seq` (0.1s). Note that the loops here used only built-ins. If you repeatedly call external programs its even worse. — Socowi, Dec 02 '19 at 10:13
@Sucowi I agree of course that Bash is slow, but porting is beyond the narrow scope of this topic — brainchild, Dec 02 '19 at 11:34
Sorry, I didn't want to urge you to use a completely different language. I just wanted to optimize your bash code for the filter. Often things can be written shorter and more efficient using the right (bash) tools for the job. You only have to know what the right tools are – that's what I like so much about programming in bash. It's like a puzzle. — Socowi, Dec 03 '19 at 01:14
How about using Redis? You could easily `LPUSH` the lines/blocks into a Redis list and start multiple processors that `BRPOP` blocks off the list and `LPUSH` results onto another list. Processor jobs could run in `bash`, Python or C++ across all machines in your network. — Mark Setchell, Dec 05 '19 at 19:12
Example of Redis... https://stackoverflow.com/a/22220082/2836621 — Mark Setchell, Dec 05 '19 at 19:16
Surely there are innumerable approaches in general, but again, the purpose of the question very narrowly relates to preserving existing code structure while adding utilization of parallel processes. The rationale for the topic is to understand limitations and capabilities offered by bash and related tools, not to brainstorm general strategies for parallel processing. Thanks. — brainchild, Dec 07 '19 at 05:08
@epl I think it may be easier to answer your question if you strictly define what you mean by "bash and related tools". Without a strict definition of that I think you will get answers, that include what the answerer thinks is "bash and related tools". E.g. I would find GNU Parallel a very related tool - it basically _only_ makes sense if run from a shell. But I feel you do not include GNU Parallel in your definition. — Ole Tange, Dec 07 '19 at 05:40
Did you already experiment with `&` and waiting on `$!`? I usually store each result in an array, wait for all PID to finish, then run the sort/final process. I'll write an answer with a short example later today if you'd like. — Matthieu, Dec 07 '19 at 07:33
@OleTange Fair enough. I am primarily thinking of shell builtins and executable calls that are likely to be available in a \*Nix environment and that are commonly used to extend shell scripting beyond the native capabilities of the builtins. GNU Parallel would be included as a *related tool*. Databases, message queues, and specialized tools would not likely be included. Tools such as AWK, Perl, and sed may be included, but rewriting blocks of code in such a language, while plausible, was beyond the question's intention, which was rather of characterizing the limits and capabilities of Bash. — brainchild, Dec 07 '19 at 08:55
@epl It only makes sense to include Perl if you are also allowing Perl programs: You cannot use Perl without writing a Perl program. GNU Parallel is a Perl program, and you can guarantee it will be available to your script by including it in the script with --embed. GNU Parallel is actively tested on a wide range of platforms, and it is seen as a bug if it is not working on a *Nix platform. If "| while .. done |" cannot be changed into "| parallel --pipe .. |" because it is seen as a rewrite of a block, then I think it is hard to give you a proper answer apart from "no, it cannot be done". — Ole Tange, Dec 07 '19 at 18:46

Ole Tange · Answer 1 · 2019-12-05T19:00:23.950

The issue with invoking an external command is the lack of code manageability with respect to moving the filter logic into some command that can be called independently.

If that is the reason for not using GNU Parallel, it sounds as if you are not aware of parallel --embed.

--embed is made exactly because people have a need to have GNU Parallel in the same file as the rest of the code.

[output from parallel --embed]

myfilter() {
    while read line; do
      # Run tests and filters on line
      if [ "$tests_pass" ]; then
        echo "$filtered_line"
      fi
    done
}   
export -f myfilter

<generate_data> | parallel --pipe myfilter | sort <sort_option> | <consume_data>

The resulting script will run even if GNU Parallel is not installed.

I was not aware of `--embed`, which appears to be very new, and not included even in recent distributions. But if the purpose of the option is to create scripts that run without the dependency, then would the current question not be completely unrelated? — brainchild, Dec 07 '19 at 04:51

Ole Tange · Answer 2 · 2019-12-07T05:58:51.667

A useful objective would be to distribute this logic across several child processes that each run separate filter loops, and which, in turn, each consume blocks of lines from the generator, and which each produce output blocks concatenated into the sort operation. Functionality of this kind is available through tools such as GNU Parallel, but using them requires invoking an external command to run in the pipe.

You will rarely see bash scripts that do not invoke external commands. You even use sort in your pipe, and sort is an external command.

Is any convenient tool ...

Without your definition of 'convenient tool' that is impossible to answer. I would personally find parallel --pipe cmd convenient, but maybe it does not fit your definition.

... or feature available that makes the operations on the script distributable across multiple processes without disrupting the overall structure of the script? I am not aware of a Bash builtin feature, but one surely would be useful.

There is no Bash builtin. It is the primary reason why GNU Parallel has the --pipe option.

Using | parallel --pipe myfilter | seems to fit quite well with the overall structure of the script.

I have no objection of course to calling executable processes such as sort. The intention of the comment you quoted was to draw attention to the perceived limitation that the command to be passed to Parallel must be an external executable, not a piece of the current script. Do you challenge this perception? — brainchild, Dec 07 '19 at 08:27
@epl Since I do exactly that in the other answer (namely passing it a function defined in the same script - not an executable) I _do_ challenge this perception. You can even give it an alias if you use `env_parallel`. — Ole Tange, Dec 07 '19 at 18:28

multiprocessing for bash loop

2 Answers2