0

i am trying to match string from a file against another file to fetch the matched line along with the previous and next 2 lines.

i could do this with grep for a chuck file, but throws memory exhausted on the original(200M lines of keys and a 2TB input source file).

grep --no-group-separator -A 2 -B 1 -f key source

sample key file

^CNACCCAAGGCTCATT  
^ANAGCGGCAACTCGCG  

I added the "^" to each line since the key is the starting 16 characters of the line next to the one starting with '@'

The pattern is formed of the characters ATGCN having length 16 and they are random. There could be multiple matches in the source file against a pattern

sample search against file

@A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC  
CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC  
+  
:#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF  
@A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC  
ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA  
+  
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF  
@A00354:427:HVYWLDSXY:1:1101:1271:1000 1:N:0:ATTACTTC  
CNATCCCGTCTCGAGCCCGCCCCAATAGCAACAACAACAACAACAACAACAACAACAGCAACAACACCAGCAACACCAGCAACAACAGCAACAACAACAACAGCAACAACAACAACAACAACAACAACAACAACAACAACAACAACAAGA  
+  
F#FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  
@A00354:427:HVYWLDSXY:1:1101:1325:1000 1:N:0:ATTACTTC  
TNCGGTTCATAGGAATGTAGTCTTTGTAATTATGCGCAATTTCCAAACACTTCAAGGTTTTTTTGCAAATAAAACATTCAGGCCTCGTGTGTGCCGCTGCATCTTAGATCCAACGGCTCCTAGTTGCTCATATTCNACCCAAGGCTCATTAGGTGCTCCCCGTAGC  
+  
:#FFF:F,FFFFFFFFFFFF,:FFF::F,FFF,F:FFFFFFF:FFFF:FF:F:FFF:F:F:FFFFFFFF,FF,F:FF:FF::F,FFF:FFFFFF,:F::FFFFFFF:FF:FFFFF,FFFFFF,FFF:FFFFFFFFF,FFFF:FFFFFFF:  

even if i split the key file its painstakingly slow.

can it be done using perl one-liner or awk more efficiently.

The expected output would be

@A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC  
CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC  
+  
:#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF  
@A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC  
ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA  
+  
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF

i saw code like

awk 'NR==FNR{a[$1]; next} {for (i in a) if (index($0, i)) print $1}' key source

which checks if each entry in key is a substring of the source, but i couldn't get my head around to make it check for a pattern(^CNACCCAAGGCTCATT) and fetch the prev. and next lines

another way i tried and couldn't make out was, zcat key | match each line against source file > output

*may be the slowness is because of my code, any alternate is much appreciated

brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • I'm in doubt if a script with perl or awk can exceed grep in efficiency. If grep is not fast enough, please consider `ripgrep` (`rg`) which is much faster than grep. – tshiono Mar 18 '21 at 06:20
  • Refactoring this to Perl could help, as it allows you to store a hash on disk (look for [`tie`](https://perldoc.perl.org/functions/tie)). I don't think it's going to be particularly elegant or fast, but probably offers a workaround for the memory exhaustion problem. – tripleee Mar 18 '21 at 07:06
  • @tripleee yes the pattern is the first 16 chars of the line after the line starting with '@'. if you could provide a working example that would be great as i am not good at coding – Philip Francis Mar 18 '21 at 07:47
  • Can you still say something about the distribution of the patterns (or "keys" as you call them)? Is it correct to assume that your alphabet has five distinct symbols (TCGA plus N)? Are the patterns randomly distributed across the 5^16 possible values or can they be generalized? Probably [edit] your question rather than hiding details down here in the comments. – tripleee Mar 18 '21 at 07:58

1 Answers1

1

for (i in a) if (index($0, i)) would be immensely slow because you're looping 100,000,000 times per line of your "search" file (so 100M * 2TB loop iterations!) and it'd produce incorrect output as index($0, i) would find the target key anywhere on search line rather than at the start, it would have to be index($0, i) == 1 to only match at the start.

This is how to do it in awk after removing all those ^s from the start of your "key" file lines as we're going to do an efficient hash lookup with strings, not a slow regexp comparison as would be required with grep, and we're going to do 1 hash lookup per line of "source" instead of 100M string comparisons as in the awk script in your question:

$ cat tst.awk
NR==FNR { tgts[$1]; next }
c && !(--c) { print p3 ORS p2 ORS p1 ORS $0; f=0 }
{ key=substr($0,1,16); p3=p2; p2=p1; p1=$0 }
key in tgts { c=2 }

$ awk -f tst.awk key source
@A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC
CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC
+
:#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF
@A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC
ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF

See printing-with-sed-or-awk-a-line-following-a-matching-pattern/17914105#17914105 for more information on what c=2 and c && !(--c) is doing but it's setting a count for a number of lines and then becoming true (and so executing the associated action of printing the saved lines) when the count reaches zero again.

If that exceeds available memory, let us know as another approach can look something like the following pseudo-code (I am not suggesting you do this in shell!):

sort keys
sort source by middle line keeping groups of 3 lines together
while !done; do
    read tgt < keys
    while read source_line; do
        key = substr(line,1,16)
        if key == tgt; then
            print line+context
        else if key > tgt; then
            break
        fi
    done < source
done

so the idea is you don't read the next value from "key" until the current value from "source" is bigger then the one you were using. That would reduce memory usage to close to zero but it does require both input files to be sorted.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • The first awk oneliner works beautifully. It doesn't output the 4th line though, i can work around that so it really doesn't matter anyway. There was no memory issue. :) @Ed Morton – Philip Francis Mar 19 '21 at 08:05
  • I didn't notice there was a 4th line, I thought it was just 3. I updated the script to print 4 lines. – Ed Morton Mar 19 '21 at 12:58