Python code to find coding DNA with start and stop codons

Question

So recently I've been trying to write a program that detects and cuts out the coding part of a DNA sequence based on start and stop codons.

The eventual goal is to compare 2 sequences of both 240 nucleotides long, however one causes sickle-cell disease, so you want to see the difference between the coding parts that both result.

This is the code I've written so far which actually works with the sequence that comes with it.

sequence = "CCATGCTTGATCA"
sequence_list = list(sequence)
codon_list = ["ATG", "TAA", "TAG", "TGA"]
position_list = []

length_sequence = len(sequence)
length_codon = len(codon_list)
length_position = len(position_list)

n = length_sequence-1
while n >= 0:

    for i in range(0, len(codon_list)):
        codon_sub_list = list(codon_list[i])

        if sequence_list[n] == codon_sub_list[2] and sequence_list[n-1] == codon_sub_list[1] and sequence_list[n-2] == codon_sub_list[0]:
            position_list.append(n-2)
            print(sequence_list[n], "@", n)
            print(sequence_list[n-1], "@", n-1)
            print(sequence_list[n-2], "@", n-2)  
    
    n-=1

print(len(position_list))
print(sequence[position_list[length_position-1]:(position_list[0]+3)])

Now, the results of that when I made it two days ago were very promising. It gave the following results, as expected:

A at location 9

G at location 8

T at location 7

G at location 4

T at location 3

A at location 2

[7,2]

ATGCTTGA

However, today I tried continuing on the work by trying a different sequence, this time one of the 240 nucleotides long sequences. Below are the two sequences and which one it is.

Sickle-cell disease sequence:GAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCT

Normal sequence: GAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCT

This however is the results I get from executing it, I will quickly list the nucleotides and their location as most of them are irrelevant and it's mostly the last ones that matter.

[G,211] [T,210] [A,209] [G,199] [A,198] [T,197] [A,187] [A,186] [T,185] [A,145] [G,144] [T,143] [A,133] [G,132] [T,131] [G,132] [T,131] [A,130] [A,123] [G,122] [T,121] [A,78] [G,77] [T,76] [G,68] [T,67] [A,66] [G,47] [A,46] [T,45] [A,29] [G,28] [T,27] [A,1] [G,0] [T,-1]

[209, 197, 185, 143, 131, 130, 121, 76, 66, 45, 27, -1]

No sequence, just an empty line

Now clearly the problem arises when at the last codon, TGA, it notes the location of the T as -1, I however have no clue what causes this and have tried adjusting several values to somehow get it working, which it didn't do in any of the cases.

I'm wondering what can be causing this and what to do about it ? Also, I made this two days ago mostly as a draft to get started with and there are likely other things that can be better, so excuses if anything seems somewhat sloppy, in my opinion the whole while-loop could be done better but at that moment I chose it because a different approach to loop through didn't work, can't really remember what exactly it was anymore.

Note: I made a screenshot of my IDLE output to give you an idea:

Please fix your indentation. Your code is not runnable as-is. — MattDMo, Nov 30 '15 at 22:31
Also check for typos. This line is wrong: `codon_sub_list = list(codon(list[i])`. Possibly `codon_sub_list = list(codon_list[i])`? — Peter DeGlopper, Nov 30 '15 at 22:35
Also, note that negative indices are indexing the sequence from right to left, i.e. sequence[-1] will be the last element of the sequence. Can counting subsequences help anyhow? BTW, Python has few good genetics libraries, so check up on them on pypi.python.org and Google. — Dalen, Nov 30 '15 at 23:02
@MattDMo I have replaced the code and it should be working now, excuse me for the inconvenience earlier on. — KRAD, Dec 01 '15 at 08:50
@Dalen I suppose that indexing from the left to the right would be a better way then as you won't get any negative indeces ? I've heard from some, yes, but I wanted to get it working without using any at first. — KRAD, Dec 01 '15 at 08:55
Then take a look at this Q: http://stackoverflow.com/questions/33510452/python-dict-count/33510804 — Dalen, Dec 01 '15 at 13:42
Unfortunately Kate wrote it extremely bad. Most of the Q is in comments, of which many are removed now. You will have to break your sequence to a list before passing it to algo I proposed there. By using it you may parallelly compare more than one genome sequence of same length, get positions at which each of the sequence appears and their total count if you wish. Perhaps you can modify it somewhat to suit your needs. — Dalen, Dec 01 '15 at 13:51

score 1 · Accepted Answer · answered Dec 01 '15 at 09:17

I found your original code a bit unclear (starting at right edge with searching is counterintuitive to me), so I tried to code an alternative. The most important changes are that I now traverse the sequence left-to-right, and search for the codons by comparing them to subsequences in one go, not nucleotide-by-nucleotide. Here is my code, with some hopefully helpful comments. Does this do what you need? If not, let me know.

sequence = "GAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCT"
codon_list = ["ATG", "TAA", "TAG", "TGA"]

# store the starting positions of the codons
found_codon_positions = []

# note that we can use len() and [] with strings as well, no need to
# convert to list first
n = len(sequence)
k = 0
while k < n-2:
    # extract a three-nucleotide subsequence
    possible_codon = sequence[k:k+3]
    if possible_codon in codon_list:
        found_codon_positions.append(k)
    k += 1

print('found codons at indices {}'.format(found_codon_positions))

print('extracted sequence:')
print(sequence[found_codon_positions[0]:found_codon_positions[-1]+3])

Output:

found codons at indices [27, 45, 66, 76, 121, 130, 131, 143, 185, 197, 209]
extracted sequence:
TGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATG

Thank you very much for your reply ! Your code worked well ! The whole reading from right to left was something that I tried out because the other direction resulted in list out of index errors, but I understand that it must've been really confusing, this direction is also much more logical. I'm also wondering: if I want the sequence to start with a start codon (ATG) and end with a stop codon (TAA, TAG, TGA), would the best thing to do be an if-statement that checks whether the first and last indices are one of those and then remove them from the list if they aren't to narrow it down ? — KRAD, Dec 01 '15 at 18:17

Padraic Cunningham · Answer 2 · 2015-12-01T10:45:55.880

Not sure I follow all your logic but if you want to store all the indexes and find the start first and last matching subsequences:

def find_seq(s, cd):
    od = dict((s, []) for s in codon_list)
    mn, mx = None, None
    for n in range(len(s) - 1):
        seq = sequence[n:n + 3]
        if seq in od:
            od[seq].append((seq, n))
            if  mn is None:
                mn = n
            mx = n + 3
    return mn, mx, od

The dict will have all the subsequences found and the index of where each subsequence starts:

In [54]: sequence = "CCATGCTTGATCA"

In [55]: codon_list = ["ATG", "TAA", "TAG", "TGA"]

In [56]: mn,mx,od = find_seq(sequence,codon_list)

In [57]: sequence[mn:mx]
Out[57]: 'ATGCTTGA'

In [58]: od
Out[58]: {'ATG': [('ATG', 2)], 'TAA': [], 'TAG': [], 'TGA': [('TGA', 7)]}

In [59]: sequence = """GAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCT"""

In [60]: mn,mx,od = find_seq(sequence,codon_list)

In [61]: sequence[mn:mx]
Out[61]: 'TGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATG'

In [62]: od
Out[62]: 
{'ATG': [('ATG', 66), ('ATG', 130), ('ATG', 209)],
 'TAA': [('TAA', 185)],
 'TAG': [('TAG', 45), ('TAG', 197)],
 'TGA': [('TGA', 27), ('TGA', 76), ('TGA', 121), ('TGA', 131), ('TGA', 143)]}

Python code to find coding DNA with start and stop codons

2 Answers2