The multiple grep
invocations in extract.sh
are probably the main bottleneck here. An obvious optimization is to read each file only once, then print a summary in the order you want. As an added benefit, we can speculate that the report can get written as a single block, but it might not prevent interleaved output completely. Still, here's my attempt.
#!/bin/sh
for f; do
zcat "$f" |
perl -ne '
/(pattern1)/ && push @pat1, $1;
/(pattern2)/ && push @pat2, $1;
# ...
END { print "##### '"$1"'\n";
print join ("\n", @pat1), "\n";
print join ("\n", @pat2), "\n";
# ...
print "#### '"$f"'\n"; }'
done
Doing this in awk
instead of Perl might be slightly more efficient, but since you are using grep -P
I figure it's useful to be able to keep the same regex syntax.
The script accepts multiple .gz
files as input, so you can use find -exec extract.sh {} \+
or xargs
to launch a number of parallel processes. With xargs
you can try to find a balance between sequential jobs and parallel jobs by feeding each new process, say, 100 to 500 files in one batch. You save on the number of new processes, but lose in parallelization. Some experimentation should reveal what the balance should be, but this is the point where I would just pull a number out of my hat and see if it's good enough already.
Granted, if your input files are small enough, the multiple grep
invocations will run out of the disk cache, and turn out to be faster than the overhead of starting up Perl.