Global Maxima
Efficient bash scripts are typically very creative and nothing you can achieve by incrementally improving a naive solution.
The most important part of finding efficient solutions is to know your data. Every restriction you can make allows optimizations. Some examples that can make a huge difference:
- The input is sorted or data in different files has the same order.
- The elements in a list are unique.
- One of the files to be processed is way bigger than the others.
- The symbol X
never appears in the input or only appears at special places.
- The order of the output does not matter.
When I try to find an efficient solution, my first goal is to make it work without an explicit loop. For this, I need to know the available tools. Then comes the creative part of combining these tools. To me, this is like assembling a jigsaw puzzle without knowing the final picture. A typical mistake here is similar to the XY problem: After you assembled some pieces, you might be fooled into thinking you'd know the final picture and search for a piece Y that does not exist in your toolbox. Frustrated, you implement Y yourself (typically by using a loop) and ruin the solution.
If there is no right piece for your current approach, either use a different approach or give up on bash and use a better scripting/programming language.
Local Maxima
Even though you might not be able to get the best solution by improving a bad solution, you still can improve it. For this you don't need to be very creative if you know some basic anti-patterns and their better alternatives. Here are some typical examples from your script:
Some of these might seem very small, but starting a new process is way more expensive than one might suppose. Inside a loop, the cost of starting a process is multiplied by the number of iterations.
Extract multiple fields from a line
Instead of calling cut
for each individual field, use read
to read them all at once:
while read -r line; do
field1=$(echo "$line" | cut -f1 -d" ")
field2=$(echo "$line" | cut -f2 -d" ")
...
done < file
while read -r field1 field2 otherFields; do
...
done < file
Combinations of grep
, sed
, awk
Everything grep
(in its basic form) can do, sed
can do better. And everything sed
can do, awk
can do better. If you have a pipe of these tools you can combine them into a single call.
Some examples of (in your case) equivalent commands, one per line:
sed 's/^--$//g' | awk 'NF'
sed '/^--$/d'
grep -vFxe--
grep -i -B1 -A2 "^$pSeq" | sed 's/^--$//g' | awk 'NF'
awk "/^$pSeq/"' {print last; c=3} c>0; {last=$0; c--}'
Multiple grep
on the same file
You want to read files at most once, especially if they are big. With grep -f
you can search multiple patterns in a single run over one file. If you just wanted to get all matches, you would replace your entire loop with
grep -i -B1 -A2 -f <(cut -f2 -d' ' reference_file | sed 's/^/^/') \
a_very_big_file another_very_big_file
But since you have to store different matches in different files ... (see next point)
Know when to give up and switch to another language
Dynamic output files
Your loop generates multiple files. The typical command line utils like cut
, grep
and so on only generate one output. I know only one standard tool that generates a variable number of output files: split
. But that does not filter based on values, but on position. Therefore, a non-loop solution for your problem seems unlikely. However, you can optimize the loop by rewriting it in a different language, e.g. awk
.
Loops in awk
are faster ...
time awk 'BEGIN{for(i=0;i<1000000;++i) print i}' >/dev/null # takes 0.2s
time for ((i=0;i<1000000;++i)); do echo $i; done >/dev/null # takes 3.3s
seq 1000000 > 1M
time awk '{print}' 1M >/dev/null # takes 0.1s
time while read -r l; do echo "$l"; done <1M >/dev/null # takes 5.4s
... but the main speedup will come from something different. awk
has everything you need built into it, so you don't have to start new processes. Also ... (see next point)
Iterate the biggest file
Reduce the number of times you have to read the biggest files. So instead of iterating reference_file
and reading both big files over and over, iterate over the big files once while holding reference_file
in memory.
Final script
To replace your script, you can try the following awk
script. This assumes that ...
- the filenames (first column) in
reference_file
are unique
- the two big files do not contain
>
except for the header
- the patterns (second column) in
reference_file
are not prefixes of each other.
If this is not the case, simply remove the break
.
awk -v dir="$dir" '
FNR==NR {max++; file[max]=$1; pat[max]=$2; next}
{
for (i=1;i<=max;i++)
if ($2~"^"pat[i]) {
printf ">%s", $0 > dir"/"file[i]
break
}
}' reference_file RS=\> FS=\\n a_very_big_file another_very_big_file