i am trying to match string from a file against another file to fetch the matched line along with the previous and next 2 lines.
i could do this with grep for a chuck file, but throws memory exhausted on the original(200M lines of keys and a 2TB input source file).
grep --no-group-separator -A 2 -B 1 -f key source
sample key file
^CNACCCAAGGCTCATT
^ANAGCGGCAACTCGCG
I added the "^" to each line since the key is the starting 16 characters of the line next to the one starting with '@'
The pattern is formed of the characters ATGCN having length 16 and they are random. There could be multiple matches in the source file against a pattern
sample search against file
@A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC
CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC
+
:#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF
@A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC
ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF
@A00354:427:HVYWLDSXY:1:1101:1271:1000 1:N:0:ATTACTTC
CNATCCCGTCTCGAGCCCGCCCCAATAGCAACAACAACAACAACAACAACAACAACAGCAACAACACCAGCAACACCAGCAACAACAGCAACAACAACAACAGCAACAACAACAACAACAACAACAACAACAACAACAACAACAACAAGA
+
F#FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00354:427:HVYWLDSXY:1:1101:1325:1000 1:N:0:ATTACTTC
TNCGGTTCATAGGAATGTAGTCTTTGTAATTATGCGCAATTTCCAAACACTTCAAGGTTTTTTTGCAAATAAAACATTCAGGCCTCGTGTGTGCCGCTGCATCTTAGATCCAACGGCTCCTAGTTGCTCATATTCNACCCAAGGCTCATTAGGTGCTCCCCGTAGC
+
:#FFF:F,FFFFFFFFFFFF,:FFF::F,FFF,F:FFFFFFF:FFFF:FF:F:FFF:F:F:FFFFFFFF,FF,F:FF:FF::F,FFF:FFFFFF,:F::FFFFFFF:FF:FFFFF,FFFFFF,FFF:FFFFFFFFF,FFFF:FFFFFFF:
even if i split the key file its painstakingly slow.
can it be done using perl one-liner or awk more efficiently.
The expected output would be
@A00354:427:HVYWLDSXY:1:1101:1036:1000 1:N:0:ATTACTTC CNACCCAAGGCTCATTCATTATATAGTGGAGGCGGAGAACTTTCCTCCGGTTTGCCTAACATGCCAGCTGTCGGTGTCAAAACCGGCGGATCTCGGGAAGGGGGTCCTGAACTGTGCGTCTTAGGTCGATGGTAATAGGAGACGGGGGAC + :#:FFFFFF:F,FFFFFFF:FFF,FF:FFFFFF,FFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,F:FFFFFFFFFFFFFF:F:F,:F:FFFFFFFFFFF:FFF @A00354:427:HVYWLDSXY:1:1101:1108:1000 1:N:0:ATTACTTC ANAGCGGCAACTCGCGGTTCCCCTACACATAGAAAACCTACGCCACATTATTGGCTAGGACGAGTGGTTCGTCTGCGTACGCAAGATTGTTGAGATCCACTATTGTCATTCAGTACTACGGTTCTTCTTATCTTGGTCGATCGTGTAAAA + F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF
i saw code like
awk 'NR==FNR{a[$1]; next} {for (i in a) if (index($0, i)) print $1}' key source
which checks if each entry in key is a substring of the source, but i couldn't get my head around to make it check for a pattern(^CNACCCAAGGCTCATT) and fetch the prev. and next lines
another way i tried and couldn't make out was, zcat key | match each line against source file > output
*may be the slowness is because of my code, any alternate is much appreciated