Return only the longest matches in re.finditer when identifying an Open Reading Frame

Question

I am trying to write code that will identify open reading frames in a DNA sequence. An ORF is defined as a portion of a sequence that starts with ATG and ends with a stop codon TAG, TAA, or TGA. I have used a look ahead expression to find overlapping sequences. However, I want only the longest strings to be printed.

(?=(ATG(?:[ATGC]{3}){%d,}?(?:TAG|TAA|TGA)))' % (aa)

I've tried something similar using a different approach (translate six frames first, look for longest M...* fragments). Is your sequence a transcript that will likely have one ORF, or a portion of genome that will have a bunch of genes? In the latter case, you are going to be really sensitive to frame shifts that will likely produce chimeric outputs... — beroe, Sep 10 '13 at 00:42
Here is a related (identical?) question... http://stackoverflow.com/questions/13114246/how-to-find-a-open-reading-frame-in-python — beroe, Sep 10 '13 at 00:47
@beroe I am looking for every possible ORF within a sequence. It could be split up into 3 'reading frames' though given that a codon is made with 3 nucleotides. — , Sep 10 '13 at 02:25
Seems like if you have two ORFs with UTR between `ATG...TAA...ATG...TGA` you'll just pull back the longest total ORF spanning those regions, or if you constrain to shortest match, you'll go from any internal M to the stop codon. — beroe, Sep 10 '13 at 02:32
@beroe that is fine for the purposes of what I'm doing. The biggest problem is that the code is pulling out shorter regions that aren't actually an open ORF. e.g.: ATGAAAATG...TAA will pull out that whole sequence plus ATG...TAA — , Sep 10 '13 at 02:35
Just a "normal" regexp seems to get me the absolute longest match: `(ATG(?:[ATGC]{3}){1,}(?:TAG|TAA|TGA))` (remove initial `?=` and the `?` after `{}`) http://tinyurl.com/pmgcyln — beroe, Sep 11 '13 at 01:01

nhahtdh · Answer 1 · 2013-09-09T23:57:15.783

1

Non-solution

Just remove the look-ahead. The match will consume the text, and disallow the matched text from being matched again (which gives extra unwanted results).

'(ATG(?:[ATGC]{3}){%d,}?(?:TAG|TAA|TGA))' % (aa)

I assume your requirement is to find all sequences, except those that ends at the same index but is shorter than an existing sequence.

Solution 1: Building on top of current solution

Note that your current regex will still allow invalid sequence to be matched when ATG is too close to the end codon. You still need to use a negative look-ahead to prevent invalid sequences. Then there is no longer a need for the lazy quantifier.

'(?=(ATG(?:(?!TAG|TAA|TGA)[ATGC]{3}){%d,}(?:TAG|TAA|TGA)))' % (aa)

You can then post-process all the matches and filter out unwanted matches. You should record all the matches with the corresponding start and end indices. Sort the matches by the end index, and for each of the end index, keep only the match with the smallest start index.

Solution 2: Reverse the string and use regex

It is possible to do so by first reversing the sequence, and iterate through the matches of the following regex:

'(?=((?:GAT|AAT|AGT)(?:(?!GAT|AAT|AGT)[ATGC]{3}){%d,}GTA))' % (aa)

The regex uses negative look-ahead to ensure that there is no end-codons inside the sequence, and the quantifier is made greedy to get the longest instance.

The effect is not replicable with the normal order of the sequence. Since you requires the indices of the ending codons to be unique, I make use of the fact that there can only be one match per index to enforce that condition. There is no way to enforce unique ending position with the level of support in re module.

You don't need to reverse the string if you use regex module. You only need to set the REVERSE flag to enable reverse searching with the same regex as above (not tested).

edited Sep 09 '13 at 23:57

answered Sep 09 '13 at 22:47

nhahtdh

55,989
15
126
162

You probably want to drop the `?` from `{%d,}?`, too. – Alan Moore Sep 09 '13 at 22:58
There can still be ORFs that are in a different frame and thus could overlap. For example: ATGAATGAAAAAATAAAATAA – Sep 09 '13 at 23:10
Okay, I get it. What I *don't* get is why you were using a lookahead in the first place. – Alan Moore Sep 09 '13 at 23:12
1

@AlanMoore: He just explained with an example. – nhahtdh Sep 09 '13 at 23:13
@gr8skillz: If you want the longest match (only 1), then just use your original regex, and pick out the longest one out of all the matches. – nhahtdh Sep 09 '13 at 23:19
@nhahtdh I guess that's what my original question is. How do I pick out only the longest? There are more than one Open Reading Frames throughout a genome, so I expect more than one result. – Sep 09 '13 at 23:21
@nhahtdh how could I use post-processing? Check for identical reads within the same frame? – Sep 09 '13 at 23:47
@gr8skillz: I edit again to include more details. The post-processing step in solution 1 is just a conceptual solution. You might try a solution with high-order function. – nhahtdh Sep 10 '13 at 00:00
@nhahtdh I tried using approach similar to 1 but now I am getting an ORF length of 0. Do you know why that would be? – Sep 10 '13 at 02:52
@gr8skillz: Compare the end index, not the end codons. – nhahtdh Sep 10 '13 at 03:24
@nhahtdh I still get 0 – Sep 10 '13 at 03:26
@gr8skillz: It is just some basic programming error. Please fix it yourself. – nhahtdh Sep 10 '13 at 03:35

Return only the longest matches in re.finditer when identifying an Open Reading Frame

1 Answers1

Non-solution

Solution 1: Building on top of current solution

Solution 2: Reverse the string and use regex