Getting Python to read a text file in chunks of 3 (codons) and give me an output

Question

I have a text file containing 3 columns - stop codon, skipping context and a sequence of 102 bases which come immediately after the skipping context which looks a bit like this

TAG GTTAGCT CTCGTGGTCCTCAAGGACTCAGAAACCAGGCTCGAGGCCTATCCCAGCAAGTGCTGCTCTGCTCTGCCCACCCTGGGTTCTGCATTCCTATGGGTGACCC
TAG GTTAGCT CTTATTCCCAGTGCCAGCTTTCTCTCCTCACATCCTCATAATGGATGCTGACTGTGTTGGGGGACAGAAGGGACTTGGCAGAGCTTTGCTCATGCCACTC
TAG GTTAGCT CTATTGTGTAACTGAGCAATTCTTTTCACTCTTGTGACTATCTCAGTCCTCTGCTGTTTTGTAACTGGTTTACCTCTATAGTTTATTTATTTTTAAATTA

etc...

I want to know how I can write a program to read the 3rd column of this text file (i.e. the 102 base sequence) and I need it to read in chunks of threes and pick out any stop codons from the sequence - 'TAG', 'TGA', or 'TAA' and create a list or table or something similar to tell me if each sequence contains any of these stop codons and if so, how many.

So far I have done this to get Python to read only the 3rd column of that text file:

inFile = open('test stop codon plus 102.txt', 'rU')
outFile = open('TAG plus 102 reading inframe.txt', 'w')


for line in inFile:
    parts = line.split('\t')
    stopcodon = parts[0]
    skippingcontext = parts[1]
    plus102 = parts[2]`

But I'm not sure where to go next.

Thanks in advance!

depending on what information you need at the end, but if you only need to know if the stop codon is there, a match with a regex of the form `^(.{3})*((TAG)|(TGA)|(TAA))` should work — njzk2, Nov 19 '14 at 14:19
Does 'chunks of threes' mean *anywhere* or only at positions that divide by three (0, 3, 6, 9, ...)? — , Nov 19 '14 at 14:22
@LutzHorn yes it means positions that divide by 3, I want it to start from the beginning and read the sequence as 'CTC', 'GTG', 'GTC' etc, then tell me if it finds any stop codon in the sequence and then how many of those stop codons it finds, if any. — lc336, Nov 19 '14 at 14:26
I have tested it and it seems to work but then it returns an error: 'stopcodon, skippingcontext, plus102 = line.split() ValueError: need more than 0 values to unpack' — lc336, Nov 19 '14 at 14:41

score 1 · Answer 1 · answered Nov 19 '14 at 13:59

I am not sure if I understand your question but you can try this.

Python:

input = """TAG GTTAGCT CTCGTGGTCCTCAAGGACTCAGAAACCAGGCTCGAGGCCTATCCCAGCAAGTGCTGCTCTGCTCTGCCCACCCTGGGTTCTGCATTCCTATGGGTGACCC
TAG GTTAGCT CTTATTCCCAGTGCCAGCTTTCTCTCCTCACATCCTCATAATGGATGCTGACTGTGTTGGGGGACAGAAGGGACTTGGCAGAGCTTTGCTCATGCCACTC
TAG GTTAGCT CTATTGTGTAACTGAGCAATTCTTTTCACTCTTGTGACTATCTCAGTCCTCTGCTGTTTTGTAACTGGTTTACCTCTATAGTTTATTTATTTTTAAATTA"""

for line in input.split("\n"):
    print(line)
    stopcodon, skippingcontext, plus102 = line.split()
    words = [plus102[s:s+3] for s in range(0, len(line.strip()) - 3)]
    for stopword in ["TAG", "TGA", "TAA"]:
        c = words.count(stopword)
        print("{} {}".format(stopword, c))

Output:

TAG GTTAGCT CTCGTGGTCCTCAAGGACTCAGAAACCAGGCTCGAGGCCTATCCCAGCAAGTGCTGCTCTGCTCTGCCCACCCTGGGTTCTGCATTCCTATGGGTGACCC
TAG 0
TGA 1
TAA 0
TAG GTTAGCT CTTATTCCCAGTGCCAGCTTTCTCTCCTCACATCCTCATAATGGATGCTGACTGTGTTGGGGGACAGAAGGGACTTGGCAGAGCTTTGCTCATGCCACTC
TAG 0
TGA 1
TAA 1
TAG GTTAGCT CTATTGTGTAACTGAGCAATTCTTTTCACTCTTGTGACTATCTCAGTCCTCTGCTGTTTTGTAACTGGTTTACCTCTATAGTTTATTTATTTTTAAATTA
TAG 1
TGA 2
TAA 3

score 0 · Answer 2 · edited May 23 '17 at 12:20

0

You already have the plus102 part, okay. Are you sure about "I need it to read in chunks of threes"? Then, that is your question, and this question has already been answered on SO:

edited May 23 '17 at 12:20

Community

1
1

answered Nov 19 '14 at 13:55

Dr. Jan-Philip Gehrcke

33,287
14
85
130

I have looked at these questions before I wrote mine and they didn't provide the right answer. I the answers above both helped though. – lc336 Nov 19 '14 at 14:10

JulienD · Accepted Answer · 2014-11-19T18:58:21.533

0

To read the 102nt sequence 3 by 3:

by3 = [plus102[i:i+3] for i in range(0,len(plus102),3)]

To find the position (in the sequence) of stop codons in it:

stops = [(3*i,x) for i,x in enumerate(by3) if x in ["TAG","TGA","TAA"]]

Do you need to consider the phase also?

To write to file:

g = open("outfile.txt", "w")
for (i,x) in stops:
    g.write("Stop codon " + x + " found at position " + str(i) + "\n")
g.close()

You may consider string formatting, a tab-delimited output (see join), etc.

edited Nov 19 '14 at 18:58

answered Nov 19 '14 at 13:55

JulienD

7,102
9
50
84

Not sure what is meant by the phase, but your answer has very nearly helped, I just need to know now how to write this to a file so I would like something along the lines of 'TAG 3 found, TAA 0 found'.. etc in a separate file by doing 'outFile.write(.....)'. Would you happen to know how I could do this? Thanks. – lc336 Nov 19 '14 at 14:21
Updated. The phase is the shift from the start of the sequence (can be 0, 1 or 2). For instance if the sequence is AATCGACCA..., it can be split into codons [AAT,CGA,CCA], or [ATC,GAC,CA.] or [TCG,ACC,A..], depending or where the reading frame starts. – JulienD Nov 19 '14 at 19:00
Ah thank you, yes I do need to consider that, it needs to start right from the beginning and then be split into three from there i.e. in your example in the previous comment, I'd need it to be split [AAT, CGA, CCA...]. Your update code has seemed to work so thank you very much. – lc336 Nov 19 '14 at 20:37

score 0 · Answer 4 · answered Nov 19 '14 at 14:07

If you simply want to count the number of "TAG", "TGA", and "TAA" in plus102

import re
numberOfCodons = len(re.findall(r'(TAG|TGA|TAA)'), plus102)

Note: This gets all non-overlapping matches of pattern in string, as a list of strings (refer here), not just in chunks of three.

Getting Python to read a text file in chunks of 3 (codons) and give me an output

4 Answers4