Using Python regular expressions to split on non-overlapping character groups (ORF finding)

Question

I am working on finding a pretty and pythonic way to find open reading frames in a DNA sequence. I have found many implementations online that make use of indexing, flags and other such ugliness.

I am pretty sure that a regular expression implementation can be created, but I am bad with regex. The general idea is that I want to split a string of DNA sequence by 'ATG', 'TAG', 'TGA' and 'TAA'. But I do not want to split on overlapping regions, for example the sequence 'ATGA' should be split into 'ATG','A'. Basically go from left to right in one of each of the three frames.

edit for clarity: As said in the comments, a sequence such as ATGATTTTGA should be split into ATG, TTT, TGA despite the presence of TGA (which is in the non-zero frame)

edit2: this is how I have implemented it without regular expressions using the list comprehension splitting linked. I hate the use of flags though.

def find_orf(seq):
    length = 0
    stop = ['TAA','TGA','TAG']
    for frame in range(3):
        orfFlag, thisLen = None, 0
        splitSeq = [seq[start+frame:start+frame+3] for start in range(0,len(seq),3)]
        for codon in splitSeq:
            if codon == 'ATG':
                orfFlag = True
                thisLen += 1
            elif orfFlag and codon in stop:
                orfFlag = None
                if thisLen > length:
                    length = thisLen
            else:
                thisLen += 1
    return length

"not want to split on overlapping regions, for example the sequence 'ATGA' should be split into 'ATG','A'. " The two parts of this sentence ("not want to split" and "should be split") seem to conflict with each other. Can you clarify or give some more examples to better describe your needs? — KQ., Nov 04 '13 at 05:08
Sorry, here is an example. 'ATGAAATAA' should be split into 'ATG', 'AAA', 'TAA' despite the presence of 'TGA' in the sequence (in a different fame) — Ian Fiddes, Nov 04 '13 at 05:09
The amino acids + stop/start codons? 64 possible triplets... Why don't you just split it to every 3 characters? — Joe, Nov 04 '13 at 05:17
I am trying to find the length of open reading frames in a sequence. The actual content of the frame is not important. So my approach was to design a regular expression to split the frame based on the start and 3 stop codons then look for the longest string in the resulting list, then add 3 to that before returning. I guess I could split to every 3 and look for start/stop iteratively but it doesn't seem as pretty. — Ian Fiddes, Nov 04 '13 at 05:21
Have you looked into Biopython? http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc294 Up to 6 reading frames, right? Since you're trying to find the lengths of the different ones, you'd need to iterate through them. You don't care what's in it as long as a stop codon isn't in there, so splitting to every 3 AS LONG AS stop codons aren't in there would be the best way to go, imo. Now, if say you just need the length of the longest one, then do a regex search for stop codons, only if NOT found proceed with splitting and then finding length. — Joe, Nov 04 '13 at 05:31
If I could use biopython this would be easy, but unfortunately I can't have it installed on the servers I am working on. — Ian Fiddes, Nov 04 '13 at 05:34
Well, either way you're going to need to iterate through the up to 6 frames. I suggest splitting the 6 frames into arrays containing the triplets, and then seeing if any of the 6 contain the stop codons. Eliminate the ones that do, if any, and then you can compare the length of the arrays to find the longest one. — Joe, Nov 04 '13 at 05:37
I have become convinced that regex is the wrong tool for finding ORFs (yes, even wore than for parsing HTML ;^). I have been using an implementation that translates all 6 frames first (using a dictionary), then splits on stop codons, and looks for the longest ORF, or the one with fewest fragments. If you expect a start codon, you can split on the first M and then by stop codons. — beroe, Nov 04 '13 at 05:44

score 2 · Answer 1 · answered Nov 04 '13 at 05:21

2

I am not sure your suggested regex method is particularly pythonic, but the basic regex:

import re
v=re.compile("((ATG)|(TGA)|(TAG)|(TAA))")
test="CCATGACCCATGCACCATTGAC"
for i in v.findall(test):
   print i

Does miss the first TGA that is part of ATGA and only reports the second. In general though this won't work as you will have to assume the frame of your gene, which likely is not known ahead of time.

A very readable way would be to simply do list comprehensions over all three reading frames.

answered Nov 04 '13 at 05:21

evolvedmicrobe

2,672
2
22
30

The idea would be to check all 3 frames in a loop, since I am only looking to determine the longest ORF in a sequence. I am looking at a list comprehension method such as http://stackoverflow.com/questions/9475241/split-python-string-every-nth-character for splitting into 3-mers but I am unsure how I would actually find the ORFs without using a flag – Ian Fiddes Nov 04 '13 at 05:32

score 0 · Answer 2 · edited May 23 '17 at 11:57

0

I'd recommend a generator. Very pretty and easy to understand/maintain without involving a regex, which is the wrong tool for the job for just splitting strings into chunks:

def chunks(s, n):
    """Produce `n`-character chunks from `s`."""
    for start in range(0, len(s), n):
        yield s[start:start+n]

chars = "ATGAAATAA"
for chunk in chunks(chars, 3):
    print chunk

Output:

ATG
AAA
TAA

Try yourself here: http://ideone.com/4yQw4y

Full credit for the implementation of the algorithm goes to the answer here: Split string by count of characters

edited May 23 '17 at 11:57

Community

1
1

answered Nov 04 '13 at 05:21

sdanzig

4,510
1
23
27

http://stackoverflow.com/questions/7111068/split-string-by-count-of-characters http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python You could at least give credit to the original answer you just copied straight from. – Joe Nov 04 '13 at 05:23
Wasn't trying to pretend it was me. Just giving him an answer. Added credit as suggested. – sdanzig Nov 04 '13 at 05:25

score 0 · Answer 3 · answered Nov 04 '13 at 05:28

Still not sure if you are wanting just all sequences of 3 letters or if you want only specific three letter sequences. However, a regex match of a sequence of characters will "consume" those characters and not overlap matched characters with subsequent regex searching, so:

If you want the first:

r = re.compile('[ATG]{3}')
r.findall('ATGAAATAA')

If you want the second:

r = re.compile('(ATG|TAG|TGA|TAA|AAA)')
r.findall('ATGAAATAA')

Both return: ['ATG', 'AAA', 'TAA']

I did take the liberty of adding AAA to the match sequence in the second.

Using Python regular expressions to split on non-overlapping character groups (ORF finding)

3 Answers3