I am working on finding a pretty and pythonic way to find open reading frames in a DNA sequence. I have found many implementations online that make use of indexing, flags and other such ugliness.
I am pretty sure that a regular expression implementation can be created, but I am bad with regex. The general idea is that I want to split a string of DNA sequence by 'ATG', 'TAG', 'TGA' and 'TAA'. But I do not want to split on overlapping regions, for example the sequence 'ATGA' should be split into 'ATG','A'. Basically go from left to right in one of each of the three frames.
edit for clarity: As said in the comments, a sequence such as ATGATTTTGA
should be split into ATG
, TTT
, TGA
despite the presence of TGA
(which is in the non-zero frame)
edit2: this is how I have implemented it without regular expressions using the list comprehension splitting linked. I hate the use of flags though.
def find_orf(seq):
length = 0
stop = ['TAA','TGA','TAG']
for frame in range(3):
orfFlag, thisLen = None, 0
splitSeq = [seq[start+frame:start+frame+3] for start in range(0,len(seq),3)]
for codon in splitSeq:
if codon == 'ATG':
orfFlag = True
thisLen += 1
elif orfFlag and codon in stop:
orfFlag = None
if thisLen > length:
length = thisLen
else:
thisLen += 1
return length