A recursive solution. Iterate over the files one by one. For each file, check if it matches the first pattern and break early (-m1: on first match), only if it matched the first pattern, search for second pattern and so on:
#!/bin/bash
patterns="$@"
fileMatchesAllNames () {
file=$1
if [[ $# -eq 1 ]]
then
echo "$file"
else
shift
pattern=$1
shift
grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $@
fi
}
for file in *
do
test -f "$file" && fileMatchesAllNames "$file" $patterns
done
Usage:
./allfilter.sh cat filter java
test.sh
Searches in the current dir for the tokens "cat", "filter" and "java". Found them only in "test.sh".
So grep is invoked often in the worst case scenario (finding the first N-1 patterns in the last line of each file, except for the N-th pattern).
But with an informed ordering (rarly matches first, early matchings first) if possible, the solution should be reasonable fast, since many files are abandoned early because they didn't match the first keyword, or accepted early, as they matched a keyword close to the top.
Example: You search a scala source file which contains tailrec (somewhat rarely used), mutable (rarely used, but if so, close to the top on import statements) main (rarely used, often not close to the top) and println (often used, unpredictable position), you would order them:
./allfilter.sh mutable tailrec main println
Performance:
ls *.scala | wc
89 89 2030
In 89 scala files, I have the keywords distribution:
for keyword in mutable tailrec main println; do grep -m 1 $keyword *.scala | wc -l ; done
16
34
41
71
Searching them with a slightly modified version of the scripts, which allows to use a filepattern as first argument takes about 0.2s:
time ./allfilter.sh "*.scala" mutable tailrec main println
Filepattern: *.scala Patterns: mutable tailrec main println
aoc21-2017-12-22_00:16:21.scala
aoc25.scala
CondenseString.scala
Partition.scala
StringCondense.scala
real 0m0.216s
user 0m0.024s
sys 0m0.028s
in close to 15.000 codelines:
cat *.scala | wc
14913 81614 610893
update:
After reading in the comments to the question, that we might be talking about thounsands of patterns, handing them as arguments doesn't seem to be a clever idea; better read them from a file, and pass the filename as argument - maybe for the list of files to filter too:
#!/bin/bash
filelist="$1"
patternfile="$2"
patterns="$(< $patternfile)"
fileMatchesAllNames () {
file=$1
if [[ $# -eq 1 ]]
then
echo "$file"
else
shift
pattern=$1
shift
grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $@
fi
}
echo -e "Filepattern: $filepattern\tPatterns: $patterns"
for file in $(< $filelist)
do
test -f "$file" && fileMatchesAllNames "$file" $patterns
done
If the number and length of patterns/files exceeds the possibilities of argument passing, the list of patterns could be split into many patternfiles and processed in a loop (for example of 20 pattern files):
for i in {1..20}
do
./allfilter2.sh file.$i.lst pattern.$i.lst > file.$((i+1)).lst
done