-3

I have a ~740000 patterns flat file to grep on a ~30Gb directory

I have either a directory to check ~30Gb either an already shorted file ~3Gb to manage

and I want to make something like

analyse from(patterns) -> directory -> anyline with pattern >> patternfile

so I might use something like :

awk '
    BEGIN {
        while (getline <"file1") pattern = pattern "|" $0;
        pattern = substr(pattern, 2);
    }
    match($0, pattern) {for(i=1; i<=3; i++) {getline; print}}
' file2 > file3

but it gives only one big output file and not one per pattern found. (each pattern would result in 7 to 15 lines output in total) or in bash something like this* (where VB3 is already a very smaller test file)*

while read ; do grep -i $REPLY VB3.txt > OUT/$REPLY.outputparID.out ; done < listeID.txt

and so on

but a rapid caclulation gives me an estimation of more than 5 days do get results...

how can I do just the same below 2/3 hours maximum or better ? the difficulty here is that I need to get separated results so the grep -F (-f) method cannot work

francois
  • 15
  • 5
  • what proportion of the 30GB file is expected to be matches? ie. what is approximate expected total size of all the output files? 30GB? 10GB? 1GB? 1MB? – jhnc Sep 09 '22 at 19:35
  • 2
    Don't use the word `pattern` in this context if you want a robust answer that does what you actually want, see [how-do-i-find-the-text-that-matches-a-pattern](https://stackoverflow.com/questions/65621325/how-do-i-find-the-text-that-matches-a-pattern) – Ed Morton Sep 09 '22 at 20:49
  • are you saying we have an option of scanning a 3GB file vs 30GB of directory/files? please update the question with 3-5 samples of your 740K patterns, a 15-20 line file (to scan; include some lines with single and multiple matches; include some lines with no matches), and also provide the expected results (corresponding to the sample inputs) – markp-fuso Sep 09 '22 at 23:04

2 Answers2

2

You would want to scan the files once for all patterns. The approach should be, load the patterns in memory, check for each pattern, accumulate results per pattern.

something like this should work (untested script)

$ awk 'NR==FNR{pat[$0]=NR; next} 
              {for(p in pat) 
                 if($0~p) {
                    close(file); 
                    file=pat[p]".matches";
                    print > file;
                 }}' patterns.file otherfiles...

I suggest you get a small sample of patterns and small number of files and give it a try.

The filenames are indices of the patterns used, should be OK to look back what those are. Since patterns may have special chars you may not want to use them as filenames directly.

Please post the timings if you can use this or a variation of it.

Another suggestion: Opening/closing thousands of files may have significant time cost. In that case, record the results in a single file but keyed with a pattern (or pattern index). Once done, you can sort the results and split to individual results per key.

Again, untested...

$ awk 'NR==FNR{pat[$0]=NR; next} 
              {for(p in pat) 
                 if($0~p) print pat[p] "\t" $0;
              }' patterns.file otherfiles...  | sort > omni.file

and separate them

$ awk -F'\t' 'prev!=$1 {close(file); prev=$1; file=$1".matches"}
                       {print $2 > file)' omni.file
                                          

assumes there is not tab in the results, otherwise either find an unused char as a delimiter or set the $1 to null and reset $0.

karakfa
  • 66,216
  • 7
  • 41
  • 56
1

Inspired by @karakfa's "Another suggestion" something like this might work well for you (untested):

sep=$'\f'
grep -hf patterns directory/* |
awk -v OFS="$sep" '
    NR==FNR {
        pats[$0]
        next
    }
    {
        for ( pat in pats ) {
            if ( $0 ~ pat ) {
                print pat, $0
            }
        }
    }
' patterns - |
sort -t "$sep" |
awk -F "$sep" '
    $1 != prev {
        close(out)
        out = $1 ".txt"
        prev = $1
    }
    {
        sub(FS".*","")
        print > out
    }
'

The above assumes none of your "patterns" contain a form feed character (if they can, set sep to some other character, e.g. use \n for sep and \0 for RS/ORS and sort -z if your tools support NUL terminator chars) and that by "pattern" you mean "partial line regexp". If you mean something else then change the regexp parts to use whatever other form of pattern matching you need.

The first grep is so you're only looping through the patterns for each line that you know matches at least one of them.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185