finding open reading frames in python

Question

I want to look for Open Reading Frames in a bunch of large sequences. Herefore I use the ORF_finder function from BioPython. This works perfect, I can print the nucleotide sequences with an ORF bigger than a certain size and I can also print the protein sequences.

The script looks like this:

def ORF_Finder(fasta_file, min_length=0, por_n=100):
    table = 11
    min_pro_len = 1000
    min_pro_len2 = 400
    test = 'ORF'
    for record in SeqIO.parse(fasta_file, "fasta"):
        print record
        min_pro_len = 100
        for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
            for frame in range(3):
                length = 3 * ((len(record) - frame) // 3)  # Multiple of three
            for pro in nuc[frame:frame + length].translate(table).split("*"):
                if len(nuc) >= 4000:
                    if len(pro) >= min_pro_len:
                        outfile.write('>' + str(record.id) + '\n' + str(pro + '\n'))
                        print("%s...%s - length %i, strand %i, frame %i" \
                              % (pro[:30], pro[-3:], len(pro), strand, frame))

If I print record.seq I get the entire sequence, but what I want is the nucleotide sequence of this particular protein.

How to get these sequences?

Best regards,

Bas

To clarify things, I use a nt sequence as input, eg:

TAATAATAGTAGTAATAGATGATGATGATGATGCGACGACGA

Then I run the ORF finder script which can gives me the following amino acid sequence:

 MMMMMRRR

But I'm not interested in the amino acid sequence but in the nucleotide sequences which codes for the amino acid, eg:

ATGATGATGATGATGCGACGACGA

And I don't know how to get this sequence out

Hey, sorry for my bad Biology here, can you print a fragment of what your program outputs to the file and also what you want to get printed (for one record), haven't used biopython, but maybe I can help you. — avenet, Jan 12 '15 at 13:19
So what I have is a nucleotide sequences eg TATTAGTAGCTATAGTAGCTAGATGATGATGATG This is translated in amino acids eg ISSYSS*STOP*MMMM This program selects the MMMM but what I actually want is the corresponing nucleotide sequences eg ATGATGATGATG — Bas, Jan 12 '15 at 13:39
I don't get it. You want the nucleotide sequence corresponding to the protein you just found with your ORF_Finder() function? In general, you want to provide an example of input and output you get, and the output you expect instead. — jrjc, Jan 12 '15 at 18:54
Indeed, that is what I want. I edited my questing and included an example — Bas, Jan 12 '15 at 19:22
Is this homework ? Or you really want to find ORFs? Because there are pgm doing that pretty well. — jrjc, Jan 13 '15 at 08:18
It's not homework.. so if there's another program which can do this, that's also fine by me. — Bas, Jan 13 '15 at 08:34
So you should look at pgm like Prodigal or Glimmer. And if you really want to annotate ORFs, the rules you are using aren't enough by far, you'll get too many False positives. (that why you should use the aforementioned software) — jrjc, Jan 13 '15 at 13:57
If you want to check with your own python code, look at my answer here: http://stackoverflow.com/questions/13114246/how-to-find-a-open-reading-frame-in-python?rq=1 — Stefan Gruenwald, Aug 03 '15 at 21:07

finding open reading frames in python

0 Answers0