7

I am a biology graduate student and I taught myself a very limited amount of python in the past few months to deal with some data I have. I am not asking for homework help, this is for a research project.

With this code I intend to take a portion of a string called sequence, between: find the start site of "protein translation," or the first occurrence of ATG (biological term is start codon), then the first occurrence of TAA (stop codon).

Then the function translate_dna() should, for every three letters in the string, swap for the dictionary value. The variable CDS exists properly, but for, or if loop in my function is not working :(. Any suggestions? The input file is formatted as follows:

>gnl|GNOMON|230560476.m Model predicted by Gnomon on Homo sapiens unplaced genomic scaffold, alternate assembly HuRef DEGEN_1103279082069, whole genome shotgun sequence (NW_001841731.1)
CCCCAGTAGCTGGGATTACAGGTTATCCAAGGACATGGAAAAGCCAACACCATGGTAGCATTAATGAAAG
TTTACCAAGAGGAAGATGAAGCCTACCAGGAATTAGTTACCATGGCAACCATGTTTTTCCAGTACTTACT
GCAGCCATTTAGGGCTATGCGAGAAGTTGCAACTTTATGTAAGCTTGAT

>gnl|GNOMON|230560472.m Model predicted by Gnomon on Homo sapiens unplaced genomic scaffold, alternate assembly HuRef DEGEN_1103279082069, whole genome shotgun sequence (NW_001841731.1)
GCCGGCGTTTGACCGCGCTTGGGTGGCCTGGGACCCTGTGGGAGGCTTCCCCGGCGCCGAGAGCCCTGGC
TGACGGCTGATGGGGAGGAGCCGGCGGGCGGAGAAGGCCACGGGCTCCCCAGTACCCTCACCTGCGCGGG
ATCGCTGCGGGAAACCAGGGGGAGCTTCGGCAGGGCCTGCAGAGAGGACAAGCGAAGTTAAGAGCCTAGT
GTACTTGCCGCTGGGAGCTGGGCTAGGCCCCCAACCTTTGCCCTGAAGATGCTGGCAGAGCAGGATGTTG
TAACGGGAAATGTCAGAAATACTGCAAGCAAACTGAAAACAACCCATCCATGTAGGAAAGAATAACACGG
ACTACACACTATGAGGAAACCACAGGGGAGTTTCAGGCCAGTCAGCTTTTGATCTTCAACTTTATAACTT
TCACCTTAGGATATGACGAGCCCACCGGAGTTTCAAAAATGGTATCATTTTGTATCAGGCTTGTTTTTTA
CACTCTTGGTTTCTCACAGAGATAGGTGGTTTCTCCTTAAAATCGAACATTTATATGATGCATTTTACTG
TAGTTACTATCAGAAAAGTTAGTTTTCCCAAATTTAAGTTCACTCTGGGGTACTATAGCGTGAATGTAGT
TCATTCTGTTGAGCTAGTTGTTCATGTTAGTGTAGTTCACATATTTATCTGGAACTCAAAAATGAGGGGT
TGAGAGGGGAAGCTAAAATTCAAAACATGTCCAAATATATAATTTTAATATTTTACTTTATATTTAAAAT
AGAAAAGCAATTGATTCTAGAATTAGACTAATTGCTAGCATTGCTAGGATATATAAAATGAAGCTGAATG
TTTTAACTCTGGAATTTTTCTGAATAGTCTAAGAAATAAGGCTGAAGTGTATCACTTGCCTTAAGTTTAC
TTTTGCGTGTGTGTTTTAATTTTGTTCAGTGGGGCTTTCACTTAAAAAAAAAACCATAATATTATTACCT
GGATAAAAAATACAGCTGAAAGTAGATCACTTTATCTTTAAGCAGAAGGATGGAAATAGAAGAATTTTAA
GAATGTATTGGTTGAAAAACATCTATATTATTTTATTTTTATTTCTCTTCTTGTGGGAGTAAAATAATTT
CCAACCAAATCAGTCCACCTAGATTATACACTGTTCAGTTTGTTTTCTGCCCTGCAGCACAAGCAATAAC
CAGCAGAGACTGGAACCACAGCTGAGGCTCTGTAAATGAGTTGACTGCTAAGGACTTCATGGGGATATTA
ACCTGGGGCATTAAGAGAATCAACATGCTAAAGTACTTGGAGACAGCTCTGTAATGTTTTATGAGGTTTT
TTGTTTTTTTTTTTTGAGACAGAGTCTTGCACTGTCGCCCAGGCTGG

Code:

from sys import argv
script, filename = argv

def translate_dna(sequence):

    codontable = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }
    proteinsequence = ''
    start = sequence.find('ATG')
    sequencestart = sequence[int(start):]
    stop = sequencestart.find('TAA')
    cds = str(sequencestart[:int(stop)+3])

    for n in range(0,len(cds),3):
        if cds[n:n+3] in codontable == True:
            proteinsequence += codontable[cds[n:n+3]]
            print proteinsequence
        sequence = ''


header = ''
sequence = ''
for line in open(filename):
    if line[0] == ">":
        if header != '':
            print header
            translate_dna(sequence)

        header = line.strip()
        sequence = ''
    else:
        sequence += line.strip()

print header 
translate_dna(sequence)
zero323
  • 322,348
  • 103
  • 959
  • 935
  • If this not a homework why don't you use for example [Biopyton](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc25)? – zero323 Oct 22 '13 at 15:27
  • Someone has pointed this out to me and I actually started there. I don't believe Biopython limits to the first in frame start and stop codon. This is okay if you are unsure of the RNA data you have, but I know I have mRNA which have a definitive singular start and stop site, so when I get results with multiple start and stop Amino Acids in them, something is wrong – Karl Eric Swanson Oct 22 '13 at 15:43
  • If this assumption about biopython is incorrect, then I must have read the cookbook wrong and even someone highlighting my reading error would help. I'm not opposed to any solution here. – Karl Eric Swanson Oct 22 '13 at 15:44
  • I haven’t use Biopython for a while but if I remember correctly it should suit your needs. – zero323 Oct 22 '13 at 16:29
  • as far as I know there is no U in codon table and it's T instead. And T stands for Threonine and that makes sense. – yukashima huksay Oct 16 '17 at 14:02

2 Answers2

8

Your problem stems from the line

if cds[n:n+3] in codontable == True

This always evaluates to False, and thus you never append to proteinsequence. Just remove the == True portion like so

if cds[n:n+3] in codontable

and you will get the protein sequence. Also, make sure to return proteinsequence in translate_dna().

mdml
  • 22,442
  • 8
  • 58
  • 66
2

There is one more problem in your code - when you use stop = sequencestart.find('TAA') you don't care about opened reading frame. In code below I split sequence into triplets and use itertools.takewhile to handle that but it can be done using loops as well:

from itertools import takewhile

def translate_dna(sequence, codontable, stop_codons = ('TAA', 'TGA', 'TAG')):       
    start = sequence.find('ATG')

    # Take sequence from the first start codon
    trimmed_sequence = sequence[start:]

    # Split it into triplets
    codons = [trimmed_sequence[i:i+3] for i in range(0, len(trimmed_sequence), 3)]
    print(len(codons))
    print(trimmed_sequence)
    print(codons)

    # Take all codons until first stop codon
    coding_sequence  =  takewhile(lambda x: x not in stop_codons and len(x) == 3 , codons)

    # Translate and join into string
    protein_sequence = ''.join([codontable[codon] for codon in coding_sequence])

    # This line assumes there is always stop codon in the sequence
    return "{0}_".format(protein_sequence)
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thank you as well. This works perfectly and I'm no authority, but from what I've seen, seems more pythonic than my original code. – Karl Eric Swanson Oct 22 '13 at 16:15
  • Oh also, after looking at the results, this is close, but I think the codons = [trimmed_sequence[i:i+3] for i in range(len(trimmed_sequence)/3)] should actually look like codons = [trimmed_sequence[i:i+3] for i in range(0, len(trimmed_sequence), 3)] – Karl Eric Swanson Oct 22 '13 at 22:05
  • Otherwise it reads over every three characters in the string and moves by +1 from the original start. That is ATGG yields MW from ATG, 0-2 characters and TGG, 1-3 characters and so on, for longer sequences. – Karl Eric Swanson Oct 22 '13 at 22:09