13

I am using Python and a regular expression to find an ORF (open reading frame).

Find a sub-string a string that is composed ONLY of the letters ATGC (no spaces or new lines) that:

Starts with ATG, ends with TAG or TAA or TGA and should consider the sequence from the first character, then second and then third:

Seq= "CCTCAGCGAGGACAGCAAGGGACTAGCCAGGAGGGAGAACAGAAACTCCAGAACATCTTGGAAATAGCTCCCAGAAAAGC
AAGCAGCCAACCAGGCAGGTTCTGTCCCTTTCACTCACTGGCCCAAGGCGCCACATCTCCCTCCAGAAAAGACACCATGA
GCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTCCCCCAAAAGATGGGGGGCTTCCAGAACTCCAGG
CGGTGCCTATGTCTCAGCCTCTTCTCATTCCTGCTTGTGGCAGGGGCCACCACGCTCTTCTGTCTACTGAACTTCGGGGT
GATCGGTCCCCAAAGGGATGAGAAGTTCCCAAATGGCCTCCCTCTCATCAGTTCTATGGCCCAGACCCTCACACTCAGAT
CATCTTCTCAAAATTCGAGTGACAAGCCTGTAGCCCACGTCGTAGCAAACCACCAAGTGGAGGAGCAGCTGGAGTGGCTG
AGCCAGCGCGCCAACGCCCTCCTGGCCAACGGCATGGATCTCAAAGACAACCAACTAGTGGTGCCAGCCGATGGGTTGTA
CCTTGTCTACTCCCAGGTTCTCTTCAAGGGACAAGGCTGCCCCGACTACGTGCTCCTCACCCACACCGTCAGCCGATTTG
CTATCTCATACCAGGAGAAAGTCAACCTCCTCTCTGCCGTCAAGAGCCCCTGCCCCAAGGACACCCCTGAGGGGGCTGAG
CTCAAACCCTGGTATGAGCCCATATACCTGGGAGGAGTCTTCCAGCTGGAGAAGGGGGACCAACTCAGCGCTGAGGTCAA
TCTGCCCAAGTACTTAGACTTTGCGGAGTCCGGGCAGGTCTACTTTGGAGTCATTGCTCTGTGAAGGGAATGGGTGTTCA
TCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTA
TCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAA
GATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGG
AGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT
CTCAGGCCTTCCTACCTTCAGACCTTTCCAGATTCTTCCCTGAGGTGCAATGCACAGCCTTCCTCACAGAGCCAGCCCCC
CTCTATTTATATTTGCACTTATTATTTATTATTTATTTATTATTTATTTATTTGCTTATGAATGTATTTATTTGGAAGGC
CGGGGTGTCCTGGAGGACCCAGTGTGGGAAGCTGTCTTCAGACAGACATGTTTTCTGTGAAAACGGAGCTGAGCTGTCCC
CACCTGGCCTCTCTACCTTGTTGCCTCCTCTTTTGCTTATGTTTAAAACAAAATATTTATCTAACCCAATTGTCTTAATA
ACGCTGATTTGGTGACCAGGCTGTCGCTACATCACTGAACCTCTGCTCCCCACGGGAGCCGTGACTGTAATCGCCCTACG
GGTCATTGAGAGAAATAA"

What I have tried:

# finding the  stop codon here 

def stop_codon(seq_0):

        for i in range(0,len(seq_0),3):
            if (seq_0[i:i+3]== "TAA" and i%3==0) or (seq_0[i:i+3]== "TAG" and i%3==0) or (seq_0[i:i+3]== "TGA" and i%3==0) :
                a =i+3

                break

            else:
                a = None

# finding the start codon here 

startcodon_find =[m.start() for m in re.finditer('ATG', seq_0)]

How can I find a way to check the start codon and then find the first stop codon. Subsequently find the next start codon and the next stop codon.

I wish to run this for three frames. As mentioned earlier the three frames would be considering the first, second and third characters of the sequence as the start.

Also the sequence needs to be divided into small parts of 3. There for it should be some thing like this:

ATG TTT AAA ACA AAA TAT TTA TCT AAC CCA ATT GTC TTA ATA ACG CTG ATT TGA

Any help will be appreciated.

My final answer :

def orf_find(st0):

    seq_0=""
    for i in range(0,len(st0),3):
        if len(st0[i:i+3])==3:
            seq_0 = seq_0 + st0[i:i+3]+ " "

    ms_1 =[m.start() for m in re.finditer('ATG', seq_0)]
    ms_2 =[m.start() for m in re.finditer('(TAA)|(TAG)|(TGA)', seq_0)]

    def get_next(arr,value):
        for a in arr:
            if a > value:
                return a
        return -1




    codons = []
    start_codon=ms_1[0]
    while (True):
        stop_codon = get_next(ms_2,start_codon)
        if stop_codon == -1:
            break
        codons.append((start_codon,stop_codon))
        start_codon = get_next(ms_1,stop_codon)
        if start_codon==-1:
            break

    max_val = 0
    selected_tupple = ()
    for i in codons:
        k=i[1]-i[0]
        if k > max_val:
            max_val = k
            selected_tupple = i

    print "selected tupple is ", selected_tupple

    final_seq=seq_0[selected_tupple[0]:selected_tupple[1]+3]

    print final_seq
    print "The longest orf length is " + str(max_val)



output_file = open('Longorf.txt','w')
output_file.write(str(orf_find(st0)))

output_file.close()

The above write function does not help me in writing the content on to a text file . All i get in there is NONE.. Why this error .. Can anybody Help ?

Nodnin
  • 451
  • 2
  • 9
  • 21

3 Answers3

11

If you want to hand-code it:

import re
from string import maketrans

pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')

def revcomp(dna_seq):
    return dna_seq[::-1].translate(maketrans("ATGC","TACG"))

def orfs(dna):
    return set(pattern.findall(dna) + pattern.findall(revcomp(dna)))

print orfs(Seq)
warship
  • 2,924
  • 6
  • 39
  • 65
Stefan Gruenwald
  • 2,582
  • 24
  • 30
9

As you have tagged it Biopython I suppose you know of Biopython. Have you checked out the docu yet? http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc231 might help.

I adjusted the code from the above link a bit to work on your sequence:

from Bio.Seq import Seq

seq = Seq("CCTCAGCGAGGACAGCAAGGGACTAGCCAGGAGGGAGAACAGAAACTCCAGAACATCTTGGAAATAGCTCCCAGAAAAGCAAGCAGCCAACCAGGCAGGTTCTGTCCCTTTCACTCACTGGCCCAAGGCGCCACATCTCCCTCCAGAAAAGACACCATGAGCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTCCCCCAAAAGATGGGGGGCTTCCAGAACTCCAGGCGGTGCCTATGTCTCAGCCTCTTCTCATTCCTGCTTGTGGCAGGGGCCACCACGCTCTTCTGTCTACTGAACTTCGGGGTGATCGGTCCCCAAAGGGATGAGAAGTTCCCAAATGGCCTCCCTCTCATCAGTTCTATGGCCCAGACCCTCACACTCAGATCATCTTCTCAAAATTCGAGTGACAAGCCTGTAGCCCACGTCGTAGCAAACCACCAAGTGGAGGAGCAGCTGGAGTGGCTGAGCCAGCGCGCCAACGCCCTCCTGGCCAACGGCATGGATCTCAAAGACAACCAACTAGTGGTGCCAGCCGATGGGTTGTACCTTGTCTACTCCCAGGTTCTCTTCAAGGGACAAGGCTGCCCCGACTACGTGCTCCTCACCCACACCGTCAGCCGATTTGCTATCTCATACCAGGAGAAAGTCAACCTCCTCTCTGCCGTCAAGAGCCCCTGCCCCAAGGACACCCCTGAGGGGGCTGAGCTCAAACCCTGGTATGAGCCCATATACCTGGGAGGAGTCTTCCAGCTGGAGAAGGGGGACCAACTCAGCGCTGAGGTCAATCTGCCCAAGTACTTAGACTTTGCGGAGTCCGGGCAGGTCTACTTTGGAGTCATTGCTCTGTGAAGGGAATGGGTGTTCATCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTATCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAAGATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGGAGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGATCTCAGGCCTTCCTACCTTCAGACCTTTCCAGATTCTTCCCTGAGGTGCAATGCACAGCCTTCCTCACAGAGCCAGCCCCCCTCTATTTATATTTGCACTTATTATTTATTATTTATTTATTATTTATTTATTTGCTTATGAATGTATTTATTTGGAAGGCCGGGGTGTCCTGGAGGACCCAGTGTGGGAAGCTGTCTTCAGACAGACATGTTTTCTGTGAAAACGGAGCTGAGCTGTCCCCACCTGGCCTCTCTACCTTGTTGCCTCCTCTTTTGCTTATGTTTAAAACAAAATATTTATCTAACCCAATTGTCTTAATAACGCTGATTTGGTGACCAGGCTGTCGCTACATCACTGAACCTCTGCTCCCCACGGGAGCCGTGACTGTAATCGCCCTACGGGTCATTGAGAGAAATAA")


table = 1
min_pro_len = 100

for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
    for frame in range(3):
        for pro in nuc[frame:].translate(table).split("*"):
            if len(pro) >= min_pro_len:
                print "%s...%s - length %i, strand %i, frame %i" % (pro[:30], pro[-3:], len(pro), strand, frame)

The ORF is also translated. You can choose a different translation table. Check out http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:translation

EDIT: Explanation of the code:

Right at the top I create a sequence object out of your string. Notice the seq = Seq("ACGT"). The two for-loops create the 6 different frames. The inner for-loop translates each frame according to the chosen translation table and returns an amino acid chain where each stop codon is coded as *. The split function splits this string removing these placeholders resulting in a list of possible protein sequences. By setting min_pro_len you can define the minimum amino acid chain length for a protein to be detected. 1 is the standard table. Check out http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1 Here you see that the initiation codon is AUG (equals ATG) and the end codons (* in the nucleotide sequence) are TAA, TAG, and TGA, just like you wanted. You could also use a different translation table.

When you add

print nuc[frame:].translate(table)

right inside the second for-loop you get something like:

PQRGQQGTSQEGEQKLQNILEIAPRKASSQPGRFCPFHSLAQGATSPSRKDTMSTESMIRDVELAEEALPQKMGGFQNSRRCLCLSLFSFLLVAGATTLFCLLNFGVIGPQRDEKFPNGLPLISSMAQTLTLRSSSQNSSDKPVAHVVANHQVEEQLEWLSQRANALLANGMDLKDNQLVVPADGLYLVYSQVLFKGQGCPDYVLLTHTVSRFAISYQEKVNLLSAVKSPCPKDTPEGAELKPWYEPIYLGGVFQLEKGDQLSAEVNLPKYLDFAESGQVYFGVIAL*REWVFIHSLPSPHSDPFTLTPLLSTPQSPQSVSF*LRKGIMAQGPTLCSELSTTTQKHKMLGQ*PGLWASHAPPSRTQMGFPNSLEPRMSIPEFCKGRVVRLPLSQNEAG*DLRPSYLQTFPDSSLRCNAQPSSQSQPPSIYICTYYLLFIYYLFICL*MYLFGRPGCPGGPSVGSCLQTDMFSVKTELSCPHLASLPCCLLFCLCLKQNIYLTQLS**R*FGDQAVATSLNLCSPREP*L*SPYGSLREI

(notice the asterisks are at the stop codon positions)

EDIT: Answer to your second question:

You must return a string that you want write into a file. Create an Output string and return it at the end of the function:

output = "selected tupple is " + str(selected_tupple) + "\n"
output += final_seq + "\n"
output += "The longest orf length is " + str(max_val) + "\n"
return output
MoRe
  • 1,478
  • 13
  • 25
  • gives me an error 'STR' has no attribute ' reverse_complement' – Nodnin Oct 29 '12 at 00:26
  • did you notice the first 3 lines of code I posted? There is the conversion from your string to a Sequence-object. – MoRe Oct 29 '12 at 00:27
  • But i cant see , how will the sequence start with "ATG' and end with stop codon .. how can i find the multiple substrings in one reading frame – Nodnin Oct 29 '12 at 01:33
  • I improved my code explanation. Hope everything is clear now. Check out the translation table at http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1 – MoRe Oct 29 '12 at 01:47
  • So the star needs to be the stop codons here . and we need to write a different piece of code to find the start codon ? the documents provided here are good but have very little clear explanation . I tried to make changes but got an error and most important is it necessary to import the nio module in python ... – Nodnin Oct 29 '12 at 02:29
  • I reread the NCBI translation page as well as the Biopython help and it turns out that the translate() function just translates the string. Starting codons are translated to the letter 'M' which is otherwise not present. So you would have to go over the final strings once again to cut them starting from 'M' to '*' - the only problem is that alternative starting codons are possible: TTG and CTG. You might check out the documentation to find a different translation table or create one your own if you are not comfortable with the additional starting codons. – MoRe Oct 29 '12 at 10:49
  • I implemented this code, and I would like to know how to display the entire protein sequence. Right now it looks like the code truncates it on purpose in the final print statment. – olliepower Aug 06 '13 at 15:49
  • The problem is that this become ORF in Protein translate function), isn't right ? – Lucke Dec 11 '17 at 10:37
1

Works in a similar way:

#!/usr/bin/python3

import re

pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')

Seq= """CATGCTCAGCGAGGACAGCAAGGGCCCATTTACAGGAGCATAGTAA"""

revcompseq = Seq[::-1].maketrans("ATGC", "TACG") #reverse complement

print (pattern.findall(Seq)) #forward search
print (pattern.findall(Seq[::-1].translate(revcompseq))) #backward search

It gives you the following result with this little demo sequence:

['ATGCTCAGCGAGGACAGCAAGGGCCCATTTACAGGAGCA']
['ATGCTCCTG', 'ATGGGCCCTTGCTGTCCTCGC']
Stefan Gruenwald
  • 2,582
  • 24
  • 30