-1

I need to convert contigs into their respective protein sequences given a reference genome (i.e. I need to take a substring, whose position is already known on the string, and I need to locate the nearest start and stop codons - a specific 3 letter sequence as written in the code).

This is tricky because sometimes the first position of contig will not be a multiple of three (i.e. the first three nucleotides in the contig may not perfectly match a codon). Also, sometimes the contigs could be located in the intergenic region (i.e. between genes). The goal is to separate both coding and non-coding DNA.

This is my code so far:

from Bio.Seq import Seq
from Bio.Alphabet import generic_dna, generic_protein

start_codons = ['ATG']
stop_codons = ['TAG', 'TAA', 'TGA']

string = 'GG*TAG*CCAATT*ATG*AACGAA*TAG*GAC' #remove '*', just for visual
contigs = ['CCAA', 'TGAAC', 'GAA', 'GGAC']
positions = [5, 12, 17, 22] #position indices for each contig on string

extended_contigs = []
extended_position_contigs = []
intergenic_contigs = []
intergenic_position_contigs = []
for i in contigs:
    extended_contigs.append(#some code)
    extended_positions_contigs.append(#some code)
    intergenic_contigs.append(#some code)
    intergenic_positions_contigs.append(#some code)

I should get extended_contigs = ['ATGAAC', 'ATGAACGAA'] and extended_positions_contigs = [12, 17]. These are the contigs that are located within genes. In order to code them into peptides, I need to reach back in the string until I find the start codon and extend the initial contig (e.g. TGAAC -> ATGAAC and GAA -> ATGAACGAA)

I should also get intergenic_contigs = ['CCAA', 'GGAC'] and intergenic_positions_contigs = [5, 22]. When the missing code is run, the computer searches to the left of the string and finds a stop codon (e.g. TAG) before the start codon. Thus, the contig is located in between two genes and nothing needs to be added. These intergenic contigs are just stored in a new list.

My code continues:

prot_contigs = []
for i in extended_contigs:
    my_dna = Seq(i, generic_dna)
    my_prot = my_dna.translate()
    prot_contigs.append(str(my_prot))

Here, no new code needs to be added. After the above is run, prot_contigs = ['MN', 'MNE'].

The last step of code (which I need help with) will transform prot_contigs into new_prot_contigs = ['MN', 'E'].

How? If for any contig (e.g. 'TGAAC'), the beginning or end is part of another codon (not perfect multiple of 3), the extra codon on either end will be maintained (e.g. 'MN' stays 'MN'). Otherwise, if the contig (e.g. 'GAA') perfectly matches a codon, anything that was added onto it will be removed (e.g. 'MNE' becomes 'E').

I would try to solve the 2 portions of code myself, but I am unsure on how to take a position on a string (i.e. beginning points of contig) and look along the string to find the nearest start/stop codons, so I could determine the DNA's function and accurately sequence the protein coding contigs into peptides.

Any help would be appreciated!

Thomas
  • 25
  • 3
  • I was thinking of using the 'in' operator for the first part of the code (e.g. if 'ATG' in string: ...) however, I would need to do this locally while searching to the left of a given starting position (i.e. beginning of the contigs) along the string. Any ideas? – Thomas Oct 21 '18 at 04:57
  • Is it possible that something else is in between start_codons stop_codons and contigs? (ie TGAAC -> ATGAAC and GAA -> ATGAACXXXXYYYYGAA) – Louis Ng Oct 21 '18 at 05:40
  • Read [str.find](https://docs.python.org/3/library/stdtypes.html#str.find) – stovfl Oct 21 '18 at 07:06
  • 1
    This looks like a crosspost with your SE Bioinformatics question: https://bioinformatics.stackexchange.com/questions/5235/searching-for-start-and-stop-codons-for-protein-sequencing-of-contigs – conchoecia Oct 22 '18 at 20:49

1 Answers1

0

This answer seemed to have been answered before

How to find a open reading frame in Python

But you really don't need to do this yourself (unless you really want to). There are plenty of tools that can do this easily. An example from the EMBOSS Suite.

getorf -find 3 genome.fna genome.orf

I guess it will be more tricky to do this if you are on a Windows system, but consider doing it in a virtualbox environment. Most bioinformatics is done on unix systems nowadays.

Hielke Walinga
  • 2,677
  • 1
  • 17
  • 30