I want to look for Open Reading Frames in a bunch of large sequences. Herefore I use the ORF_finder function from BioPython. This works perfect, I can print the nucleotide sequences with an ORF bigger than a certain size and I can also print the protein sequences.
The script looks like this:
def ORF_Finder(fasta_file, min_length=0, por_n=100):
table = 11
min_pro_len = 1000
min_pro_len2 = 400
test = 'ORF'
for record in SeqIO.parse(fasta_file, "fasta"):
print record
min_pro_len = 100
for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
for frame in range(3):
length = 3 * ((len(record) - frame) // 3) # Multiple of three
for pro in nuc[frame:frame + length].translate(table).split("*"):
if len(nuc) >= 4000:
if len(pro) >= min_pro_len:
outfile.write('>' + str(record.id) + '\n' + str(pro + '\n'))
print("%s...%s - length %i, strand %i, frame %i" \
% (pro[:30], pro[-3:], len(pro), strand, frame))
If I print record.seq I get the entire sequence, but what I want is the nucleotide sequence of this particular protein.
How to get these sequences?
Best regards,
Bas
To clarify things, I use a nt sequence as input, eg:
TAATAATAGTAGTAATAGATGATGATGATGATGCGACGACGA
Then I run the ORF finder script which can gives me the following amino acid sequence:
MMMMMRRR
But I'm not interested in the amino acid sequence but in the nucleotide sequences which codes for the amino acid, eg:
ATGATGATGATGATGCGACGACGA
And I don't know how to get this sequence out