I am working on DNA sequences files (FASTQ file).
@Read1- GOOD
NAAAGTGAGATTCGAAATAAATACATCTGTGGCTTCACTTTGAACGGAACGATGTTCTCGTAT
+
1D=DDADEHHHHHIGIJJJJGGFGHIHIJJIJJJJJIIIIGG99BDGHHHEGHJJIHHJJGIH
@Read2- Has 2 bad place
NTTCGTAAAGCAGTGAACGAAATACATCTGTGGCTTCACTATGTTCTCGTATGCCGGAACGTC
+
2#1=DFFFFHHHGHGHIJHJIJJJJJJJJJJJJJJJJJGIIHJJJJIIIGGHIIJJIHIIIIJG
@Read3 : one good, one early
NCAGGATCTGCATCGTGAACGATACATCTGTGGCTTCACTAGAACGTGTTCTCGTATGCCGTC
+
B#1:BDDDDFFHDH@AHIGCHIIIIIIIIIIIIIIIIIIIIGIIFHBGGGFGIIIIGGHIIIIG
@Read4 : one good, one after
NGCCCTTGACCGCAGGTTAGTGCTAAATACATCTGTGTACTGAACGTCACTATGTTCTCGTAT
+
E#1:A?==@@B>AC<7,2A@ABBBBCBCBCCBCCBBBBBBBB<<?AA?AA)8>ABBAAABABBA
I want to look for a 6-characters-long pattern (GAACG) within a sequence (line below line starting with @).
The important thing is that I want my pattern to be found at position 42 within the string.
If the pattern is found at that position, I copy the sequence, together with the line before it, and the 2 lines following it, into a new file. When trying this with awk it didnt work because all teh index(), match() functions only look at the 1st occurrence and don't look further so if it found my pattern before position 41 then it wouldn't copy my data to the new file.
Basically my script should return reads 1, 3 and 4...
How can I screen my FASTQ file for a pattern, evaluate ALL the positions where it is found, and consider only the sequences that have it at position 42, no matter if the pattern is ALSO present at other positions?