I am working through the 'Rosalind' problems and I've become stuck on what the issue with my code is... The problem is:
Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.
An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.
Given: A DNA string s of length at most 1 kbp in FASTA format.
Return: Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.
Here is my code (Python):
DNA_Codons = {
'TTT': 'F', 'CTT': 'L', 'ATT': 'I', 'GTT': 'V',
'TTC': 'F', 'CTC': 'L', 'ATC': 'I', 'GTC': 'V',
'TTA': 'L', 'CTA': 'L', 'ATA': 'I', 'GTA': 'V',
'TTG': 'L', 'CTG': 'L', 'ATG': 'M', 'GTG': 'V',
'TCT': 'S', 'CCT': 'P', 'ACT': 'T', 'GCT': 'A',
'TCC': 'S', 'CCC': 'P', 'ACC': 'T', 'GCC': 'A',
'TCA': 'S', 'CCA': 'P', 'ACA': 'T', 'GCA': 'A',
'TCG': 'S', 'CCG': 'P', 'ACG': 'T', 'GCG': 'A',
'TAT': 'Y', 'CAT': 'H', 'AAT': 'N', 'GAT': 'D',
'TAC': 'Y', 'CAC': 'H', 'AAC': 'N', 'GAC': 'D',
'TAA': '-', 'CAA': 'Q', 'AAA': 'K', 'GAA': 'E',
'TAG': '-', 'CAG': 'Q', 'AAG': 'K', 'GAG': 'E',
'TGT': 'C', 'CGT': 'R', 'AGT': 'S', 'GGT': 'G',
'TGC': 'C', 'CGC': 'R', 'AGC': 'S', 'GGC': 'G',
'TGA': '-', 'CGA': 'R', 'AGA': 'R', 'GGA': 'G',
'TGG': 'W', 'CGG': 'R', 'AGG': 'R', 'GGG': 'G'
}
bases={"A":"T",
"T":"A",
"G":"C",
"C":"G"}
def Pro(DNA, start, Rev):
#Calculates the Reverse compliment if using
if Rev == True:
reverse=DNA[::-1]
compliment=[]
for base in reverse:
compliment+=bases[base]
Seq="".join(compliment)
elif Rev== False:
Seq=DNA
Protein=[]
#Finds a start codon
for i in range(start, len(Seq),3):
codon=Seq[i:i+3]
if codon=="ATG":
#Starting from that start codon, returns a protein, breaks if stop codon
#-2 included so that it's always in blocks of 3
for j in range(i,len(Seq)-2,3):
new_codon=Seq[j:j+3]
if DNA_Codons[new_codon]!="-":
Protein+=[DNA_Codons[new_codon]]
else:
#Adds in the '-' to split proteins that start within the same Reading Frame
Protein+=[DNA_Codons[new_codon]]
break
return Protein
f = open('rosalind_orf.txt','r').read()
#Puts each FASTA String into an arrary
strings=f.split(">")
#removes the FASTA ID from the string in array and new line characters
for i in range(len(strings)):
strings[i]=strings[i].strip("Rosalind_0123456789")
strings[i]=strings[i].replace("\n","")
DNA=strings[1]
#Adds proteins from all Open Reading Frames
Proteins=[]
for i in range(len(DNA)):
Proteins+="".join(Pro(DNA,i,False)).split('-')
Proteins+="".join(Pro(DNA,i,True)).split('-')
#Mades a list of Unique Proteins and prints them
Unique_Proteins=[]
for p in Proteins:
if (p not in Unique_Proteins and p!=""):
Unique_Proteins+=[p]
print p
Using the sample data:
Rosalind_99 AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
My code works fine, however for every question dataset I've been given it fails...
Here is one of the question datasets that I've failed on:
Rosalind_1485 GACCAGAATGCGTTAGTCGGCCTCAGAGCGCACAAAAACCAGTATTTACAAAGTGGGACG TAGCGCCCCGCGGCGTCCTTTTGCCCTATCGAAAGTATAGGCATCAGCTTTTTACCACCT TGTCATAGGTAAACTGCCCGACCCAGGTCCGGCCCTCAGCCCAACGCAGATAAACCAAGG TTATAGATGTGGCCTGTAGGCATATTGCTCTTAATGTTATAAAGAGCGAAGCGTGGTCTC GGTTTGTAAACATTAATCAAATTCCCAGGCACTAAGCCATGGTCGCCCCGGATTGGTTTT CCGGTGTACGCATCGGTGGCAGCTGGAGGGGACAGTTTAGGTGCTGCAATTGAACATGAA ACTGCACGAAAGGTGGGGTGGGCCGGATCTTGCGGGCCTCGAAAGGGTAGTGTTCCTCTG CTATCTAGTCCAATTACCTGTAGTATATATGATCAGGCCGTCGGTTACTTAGCTAAGTAA CCGACGGCCTGATCATCTCCTAGGAAATGGTCCTGAATGCGAACTAGGTTCCGTGGAATG ATGGGGCCCAGAGGAAACCTGTACGCAATGGATCCCGGACAGATAGACCGGGAGGTCTTG CAACCTCTTGTGGGAGTTACAGGCCGTACCTGAATTGCCCTCGTACCATTTGAAATGGTG CGACGCCTGTACGCAACAATCGTTCGCCTGGATAATACAGACGGCCATTTCTGTAGGAAC GATACCGTAACGCGACGTCAGGCATGACGTTAACTGCGTCACGTTTCATACCACTATGTG AGGTACCCACTCCTTCATTTACCGCGAGATAAAGAGCCACCACCACCTTCTCTTGGTTTC CATGCGCCGATCGGCTAAACGTGCATCACATTCAGGCGAAGAGTCAAATGGAAGCTCGCA ATTTTAGGCCTTTATGGCGAATATCCCGCAAGCCTTAGGCGCGT
Obviously this code is nowhere near efficient and there's lot that could be improved upon, I'm just curious as to why it's not working.