Your solution may actually be slow because it creates 50.000 processes all reading the 500 lines pattern_file.
Another "pure bash & unix utils" solution could be to let grep
do what it can do best and just match the output against your pattern_file.
So use grep
to find matching lines and the parts that actually do match.
I use word matching here, which can be turned off by removing the -w
switch in the grep line and to get initial behavior as described in your example.
The output is not yet redirected to result_file.csv
.. which is easy to add later 8)
#!/bin/bash
# open pattern_file
exec 3<> pattern_file
# declare and initialize integer variables
declare -i linenr
declare -i pnr=0
# loop for reading from the grep process
#
# grep process creates following output:
# <linenumber>:<match>
# where linenumber is the number of the matching line in pattern_file
# and match is the actual matching word (grep -w) as found in lookup_file
# grep output is piped through sed to actually get
# <linenumber> <match>
while read linenr match ; do
# skip line from pattern_file till we read the line
# that contained the match
while [[ ${linenr} > ${pnr} ]] ; do
read -u 3 pline
pnr+=1
done
# echo match and line from pattern_file
echo "$match, $pline"
done < <( grep -i -w -o -n -f lookup_file pattern_file | sed -e 's,:, ,' )
# close pattern_file
exec 3>&-
result is
sun, The sun is shining
shining, The sun is shining
beautiful, It is a beautiful day!
for the example given. Attention: the match is now the exact match where the case is preserved. So this does not results in Sun, ...
but in sun, ...
.
The result is a script which reads pattern_files once using a grep which in the best case reads pattern_file and lookup_file once - depending on the actual implementation.
It only starts two additional processes: grep
and sed
. (if needed, sed
can be replaced by some bash substitution within the outer loop)
I did not try it with 50.000 line lookup_file and 500 lines pattern_file though. But I think it may be as fast as grep can be.
As long as grep
can keep the lookup_file in memory it may be reasonable fast. (Who knows)
No matter if it solves your problem I would be interested how it performs compared to your initial script since I do lack nice test files.
If grep -f lookup_file
uses too much memory (as you mentioned in a comment before) it may be a solution to split it in portions that actually do fit into memory and run the script more then once or use more then one machine, run all parts on those machines and just collect and concatenate the results. As long as the lookup_files do not contain dupes, you can just concatenate the results without checking for dupes. If sorting matters, You can sort all single results and then merge them quiet fast using sort -m
.
Splitting up the lookup_file should not affect runtimes dramatically as long as you split the lookup_file only once and rerun the script, since your pattern_file may be small enough with its 500 lines to stay in memory cache anyway!? The same may be true for the lookup_file if you use more then one machine - its parts may just stay in memory on every machine.
EDIT:
As pointed out in my comment this will not work for overlapping files out of the box since grep -f
seems to return only the longest match and will not rematch so if lookup_file
contains
Sun
Shining
is
S
the result will be
sun, The sun is shining
is, The sun is shining
shining, The sun is shining
and not
sun, The sun is shining
is, The sun is shining
shining, The sun is shining
s, The sun is shining
s, The sun is shining
s, The sun is shining
So all the matching s
(it matches three times) are missing.
In fact this is another issue with this solution: If a string is found twice it will be matched twice and identical lines will be returned, which can be removed by uniq
.
Possible workaround: Split the lookup_file
by string length of search strings. Which will decrease maxmimum memory needed for a run of grep but also slow down the whole thing a little bit. But: You can then search in parallel (and may want to check grep
s --mmap
option if doing that on the same server).