Optimize prefix checks of lines while iterating through file

Question

I'm writing a script in the form

while read LINE 
do
    [[ $LINE =~ ^headertag1 ]] && function1 && continue
    [[ $LINE =~ ^headertag2 ]] && function2 && continue
    ...
done < filename

As the number of tags increase, I will be doing too many checks per line. I can trying sorting the common tags higher up, but I don't think it solves the fundamental issue. I'm not a software engineer. Are there programming concepts/methods that can improve this situation?

Can you generalize your functions `function1` and `function2`? If so, probably. Would be nice to know what varies between these functions. — stephanmg, Nov 08 '19 at 09:08
@stephanmg I will be parsing for information with regexes, usually in the form <<< $LINE | perl -pe 'regex' — user121392, Nov 08 '19 at 09:11
Are the functions completely different? Can you show examples for both functions? And how many functions you expect? — stephanmg, Nov 08 '19 at 09:12
@stephanmg Yes they very different because every line for a tag is unique from another tag. eg In case of headertag1, it could be perl -pe 's/hello bye//' ; for headertag2, it could be perl -pe 's/apples (oranges )bananas /$1 are delicious/' — user121392, Nov 08 '19 at 09:18
@stephanmg You mean associative arrays? I'm not sure if I follow. — user121392, Nov 08 '19 at 09:21
Well, in your case, you know the following: Tag1 -> Some Regex, Tag2 -> Some Other Regex, Tag 3 -> Yet another regex. Is this correct? — stephanmg, Nov 08 '19 at 09:23

Mika Feiler · Answer 1 · 2019-11-08T12:54:32.810

Yes, for two you can first find the longest common prefix of both (here people were wondering how to do that in Bash Longest common prefix of two strings in bash), then first check whether the lines start with it and then after stripping it from both the tag and the line check whether the lines start with the rest of it.

For more than two, you need to make a trie — also known as a prefix tree https://en.wikipedia.org/wiki/Trie .

That Wikipedia article says

For the space-optimized presentation of prefix tree, see compact prefix tree.

And having longest common prefixes, that's what you're gonna have.

Since Bash doesn't have multidimensional associative arrays, you will have to either consider https://en.wikipedia.org/wiki/Trie#Implementation_strategies or embed some other scripting language, like Perl or Python — or GNU Awk (gawk), which, unlike to standard Awk, introduces multidimensional associative arrays.

Using the optimization of Bash's associative arrays implementation

As suggested in comment, we may consider taking just the tag with a simpler regex and using it as a key for associative array which are somewhat optimized in Bash (we can investigate how well-suited for our needs in the sources:

if we know what it is delimited by — like, if we know that it is always immediately followed by a : or something while not containing it, and using a simpler regex like:

[[ $LINE =~ ^(.*): ]] && "${DICTIONARY_OF_FUNCTIONS["${BASH_REMATCH[1]}"]}"

or Using the optimization of Bash's functions store

if all your tags are like, /[a-z][a-z0-9]+/ or otherwise accepted by Bash as function names, and delimited as in the method with Bash's associative arrays, then you can use the above method for interpolating function names, like,

function the_function_for_tag_headertag1() {
    echo "hey it's the first one"
}
[[ $LINE =! ^(.*): ]] && {
    func_name="the_function_for_tag_${BASH_REMATCH[1]}"
    type "${func_name}" && "${func_name}"
}

Thank you, using BASH_REMATCH with associative arrays seemed to be the better solution. — user121392, Nov 08 '19 at 10:06
That's a great way to do this. If the delimiter is somehow captured by a regex, this is great. — stephanmg, Nov 08 '19 at 10:13

dash-o · Answer 2 · 2019-11-08T12:12:08.240

1

The test that you perform on each tag

    [[ $LINE =~ ^headertag1 ]] && function1 && continue

Is extremely cheap (in memory regexp. Most likely, it will take a fraction of the IO time associate with reading the LINE (from file, or other process). Unless you are performing the test large number of times, this implementation is reasonable.

Note about style: If all pattern are prefix match (or other simple constructs), consider using bash case statement

case "$LINE" in
   header1*) function1 ;;
   header2*) function2 ;;
   ...
esac

This will make the code more elegant, but not change the performance - both the RE and the wildcard are simple.

edited Nov 08 '19 at 12:12

answered Nov 08 '19 at 09:35

dash-o

13,723
1
10
37

1

while the test is extremely cheap, the file can be very long. If there is a better way of approaching it, I'll take it. – user121392 Nov 08 '19 at 09:42
@user121392 I think that you might be overestimated the cost of the regex/pattern match. This processing is dominated by (1) cost of executing function1/function2 - external perl scripts and (2) potentially IO bound. I do not think that you can squeeze any meaningful speedup from working on the pattern, regardless of the number of lines. Unless there 1000s patterns :-) – dash-o Nov 08 '19 at 12:16
It would help if you can add information to the question about the problems - eg., number of lines, number of patterns, processing time for each function*, etc. Hard to work without those details – dash-o Nov 08 '19 at 12:17

score 1 · Accepted Answer · answered Nov 08 '19 at 09:42

Not sure here, but if you look for tidying up your code and feel bored by adding these if guards repetitively, then maybe this idea will help:

#!/bin/bash

tags[tag1]="some regex1"
tags[tag2]="some regex2"
tags[tag3]="some regex3"

function action() {
  echo "perl -pe '${tags[$tag]} other-file.txt'"
}

while read LINE; do
  for tag in "${!tags[@]}"; do
    [[ $LINE =~ ^$tag ]] && action "${tags[$tag]}"
  done
done < filename

Not sure if the OP is asking something like this.

The solution by Michal using BASH_REMATCH as the key for associative arrays with your recommendation fixed the problem. Thank you. — user121392, Nov 08 '19 at 10:07

Optimize prefix checks of lines while iterating through file

3 Answers3

Using the optimization of Bash's associative arrays implementation

or Using the optimization of Bash's functions store