Rosalind: Open Reading Frame

Question

I am working through the 'Rosalind' problems and I've become stuck on what the issue with my code is... The problem is:

Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

Given: A DNA string s of length at most 1 kbp in FASTA format.

Return: Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.

Here is my code (Python):

    DNA_Codons = {
        'TTT': 'F',     'CTT': 'L',     'ATT': 'I',     'GTT': 'V',
        'TTC': 'F',     'CTC': 'L',     'ATC': 'I',     'GTC': 'V',
        'TTA': 'L',     'CTA': 'L',     'ATA': 'I',     'GTA': 'V',
        'TTG': 'L',     'CTG': 'L',     'ATG': 'M',     'GTG': 'V',
        'TCT': 'S',     'CCT': 'P',     'ACT': 'T',     'GCT': 'A',
        'TCC': 'S',     'CCC': 'P',     'ACC': 'T',     'GCC': 'A',
        'TCA': 'S',     'CCA': 'P',     'ACA': 'T',     'GCA': 'A',
        'TCG': 'S',     'CCG': 'P',     'ACG': 'T',     'GCG': 'A',
        'TAT': 'Y',     'CAT': 'H',     'AAT': 'N',     'GAT': 'D',
        'TAC': 'Y',     'CAC': 'H',     'AAC': 'N',     'GAC': 'D',
        'TAA': '-',     'CAA': 'Q',     'AAA': 'K',     'GAA': 'E',
        'TAG': '-',     'CAG': 'Q',     'AAG': 'K',     'GAG': 'E',
        'TGT': 'C',     'CGT': 'R',     'AGT': 'S',     'GGT': 'G',
        'TGC': 'C',     'CGC': 'R',     'AGC': 'S',     'GGC': 'G',
        'TGA': '-',     'CGA': 'R',     'AGA': 'R',     'GGA': 'G',
        'TGG': 'W',     'CGG': 'R',     'AGG': 'R',     'GGG': 'G'
    }
    bases={"A":"T",
           "T":"A",
           "G":"C",
           "C":"G"}

    def Pro(DNA, start, Rev):
            #Calculates the Reverse compliment if using
            if Rev == True:
                    reverse=DNA[::-1]
                    compliment=[]
                    for base in reverse:
                            compliment+=bases[base]
                    Seq="".join(compliment)
            elif Rev== False:
                    Seq=DNA
            Protein=[]
            #Finds a start codon
            for i in range(start, len(Seq),3):
                    codon=Seq[i:i+3]
                    if codon=="ATG":
                            #Starting from that start codon, returns a protein, breaks if stop codon
                            #-2 included so that it's always in blocks of 3
                            for j in range(i,len(Seq)-2,3):
                                    new_codon=Seq[j:j+3]
                                    if DNA_Codons[new_codon]!="-":
                                            Protein+=[DNA_Codons[new_codon]]
                                    else:
                                            #Adds in the '-' to split proteins that start within the same Reading Frame
                                            Protein+=[DNA_Codons[new_codon]]
                                            break
            return Protein
    f = open('rosalind_orf.txt','r').read()
    #Puts each FASTA String into an arrary
    strings=f.split(">")

    #removes the FASTA ID from the string in array and new line characters
    for i in range(len(strings)):
            strings[i]=strings[i].strip("Rosalind_0123456789")
            strings[i]=strings[i].replace("\n","")

    DNA=strings[1]
    #Adds proteins from all Open Reading Frames
    Proteins=[]
    for i in range(len(DNA)):
            Proteins+="".join(Pro(DNA,i,False)).split('-')
            Proteins+="".join(Pro(DNA,i,True)).split('-')
    #Mades a list of Unique Proteins and prints them
    Unique_Proteins=[]
    for p in Proteins:
            if (p not in Unique_Proteins and p!=""):
                    Unique_Proteins+=[p]
                    print p

Using the sample data:

Rosalind_99 AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG

My code works fine, however for every question dataset I've been given it fails...

Here is one of the question datasets that I've failed on:

Rosalind_1485 GACCAGAATGCGTTAGTCGGCCTCAGAGCGCACAAAAACCAGTATTTACAAAGTGGGACG TAGCGCCCCGCGGCGTCCTTTTGCCCTATCGAAAGTATAGGCATCAGCTTTTTACCACCT TGTCATAGGTAAACTGCCCGACCCAGGTCCGGCCCTCAGCCCAACGCAGATAAACCAAGG TTATAGATGTGGCCTGTAGGCATATTGCTCTTAATGTTATAAAGAGCGAAGCGTGGTCTC GGTTTGTAAACATTAATCAAATTCCCAGGCACTAAGCCATGGTCGCCCCGGATTGGTTTT CCGGTGTACGCATCGGTGGCAGCTGGAGGGGACAGTTTAGGTGCTGCAATTGAACATGAA ACTGCACGAAAGGTGGGGTGGGCCGGATCTTGCGGGCCTCGAAAGGGTAGTGTTCCTCTG CTATCTAGTCCAATTACCTGTAGTATATATGATCAGGCCGTCGGTTACTTAGCTAAGTAA CCGACGGCCTGATCATCTCCTAGGAAATGGTCCTGAATGCGAACTAGGTTCCGTGGAATG ATGGGGCCCAGAGGAAACCTGTACGCAATGGATCCCGGACAGATAGACCGGGAGGTCTTG CAACCTCTTGTGGGAGTTACAGGCCGTACCTGAATTGCCCTCGTACCATTTGAAATGGTG CGACGCCTGTACGCAACAATCGTTCGCCTGGATAATACAGACGGCCATTTCTGTAGGAAC GATACCGTAACGCGACGTCAGGCATGACGTTAACTGCGTCACGTTTCATACCACTATGTG AGGTACCCACTCCTTCATTTACCGCGAGATAAAGAGCCACCACCACCTTCTCTTGGTTTC CATGCGCCGATCGGCTAAACGTGCATCACATTCAGGCGAAGAGTCAAATGGAAGCTCGCA ATTTTAGGCCTTTATGGCGAATATCCCGCAAGCCTTAGGCGCGT

Obviously this code is nowhere near efficient and there's lot that could be improved upon, I'm just curious as to why it's not working.

1 - on rosalind_99, your code gives `['ATG', 'ATC', 'CGA', 'GTA', 'GCA', 'TCT', 'CAG']` as a legitimate protein even though there's no stop codon. Is that correct behavior? (i only know as much about this as scanning google tells me) — e.s., Apr 15 '18 at 16:33
2 - If your code is failing on Rosalind_1485, how do you know it's failing? What kind of failure message are you getting? Was the sample data also in a text file? Perhaps you are reading the text file incorrectly — e.s., Apr 15 '18 at 16:41
I tried your code with Rosalind99 and it produces two additional outputs not mentioned in the Rosalind sample output: MIRVASQ & MA. Do you experience the same? — Mr. T, May 03 '18 at 13:59
@Mr.T The Rosalind sample output is trash. In this particular case my code outputs all the sample data except for the first protein string. And I can't seem to find said string anywhere in the DNA string either. — a.anev, Oct 01 '20 at 07:07
@Mr.T I think the problem wanted a string that was terminated by a stop codon. — Nosey, Apr 14 '22 at 11:38

Rosalind: Open Reading Frame

0 Answers0