Find positions of ALL occurrences of a pattern in a string

Question

I am working on DNA sequences files (FASTQ file).

@Read1- GOOD

NAAAGTGAGATTCGAAATAAATACATCTGTGGCTTCACTTTGAACGGAACGATGTTCTCGTAT

+

1D=DDADEHHHHHIGIJJJJGGFGHIHIJJIJJJJJIIIIGG99BDGHHHEGHJJIHHJJGIH

@Read2- Has 2 bad place

NTTCGTAAAGCAGTGAACGAAATACATCTGTGGCTTCACTATGTTCTCGTATGCCGGAACGTC

+

2#1=DFFFFHHHGHGHIJHJIJJJJJJJJJJJJJJJJJGIIHJJJJIIIGGHIIJJIHIIIIJG

@Read3 : one good, one early

NCAGGATCTGCATCGTGAACGATACATCTGTGGCTTCACTAGAACGTGTTCTCGTATGCCGTC

+

B#1:BDDDDFFHDH@AHIGCHIIIIIIIIIIIIIIIIIIIIGIIFHBGGGFGIIIIGGHIIIIG

@Read4 : one good, one after

NGCCCTTGACCGCAGGTTAGTGCTAAATACATCTGTGTACTGAACGTCACTATGTTCTCGTAT

+

E#1:A?==@@B>AC<7,2A@ABBBBCBCBCCBCCBBBBBBBB<<?AA?AA)8>ABBAAABABBA

I want to look for a 6-characters-long pattern (GAACG) within a sequence (line below line starting with @).

The important thing is that I want my pattern to be found at position 42 within the string.

If the pattern is found at that position, I copy the sequence, together with the line before it, and the 2 lines following it, into a new file. When trying this with awk it didnt work because all teh index(), match() functions only look at the 1st occurrence and don't look further so if it found my pattern before position 41 then it wouldn't copy my data to the new file.

Basically my script should return reads 1, 3 and 4...

How can I screen my FASTQ file for a pattern, evaluate ALL the positions where it is found, and consider only the sequences that have it at position 42, no matter if the pattern is ALSO present at other positions?

score 0 · Answer 1 · edited May 23 '17 at 11:50

0

Sounds like a regex problem.

Many languages and scripting languages support regex, but this appears to be a good example in javascript:

how-to-find-all-occurrences-of-one-string-in-another-in-javascript

edited May 23 '17 at 11:50

Community

1
1

answered Apr 17 '12 at 20:26

Tim

1,174
12
19

Thanks for your reply. I'm working in Linux environment and I have to read through a sequence file that has millions of sequences. Also, the sequences are only every 4 rows, starting from the second row ( as shown in my original message). So I dont think javascript can really be applied in my case, unfortunately... But thanks a lot! – user1339677 Apr 17 '12 at 20:44
Actually it was simple...substr(seq,42,6)==pattern should be true..that's it! – user1339677 Apr 18 '12 at 15:10

Find positions of ALL occurrences of a pattern in a string

1 Answers1