0

I have a text file containing 3 columns - stop codon, skipping context and a sequence of 102 bases which come immediately after the skipping context which looks a bit like this

TAG GTTAGCT CTCGTGGTCCTCAAGGACTCAGAAACCAGGCTCGAGGCCTATCCCAGCAAGTGCTGCTCTGCTCTGCCCACCCTGGGTTCTGCATTCCTATGGGTGACCC
TAG GTTAGCT CTTATTCCCAGTGCCAGCTTTCTCTCCTCACATCCTCATAATGGATGCTGACTGTGTTGGGGGACAGAAGGGACTTGGCAGAGCTTTGCTCATGCCACTC
TAG GTTAGCT CTATTGTGTAACTGAGCAATTCTTTTCACTCTTGTGACTATCTCAGTCCTCTGCTGTTTTGTAACTGGTTTACCTCTATAGTTTATTTATTTTTAAATTA

etc...

I want to know how I can write a program to read the 3rd column of this text file (i.e. the 102 base sequence) and I need it to read in chunks of threes and pick out any stop codons from the sequence - 'TAG', 'TGA', or 'TAA' and create a list or table or something similar to tell me if each sequence contains any of these stop codons and if so, how many.

So far I have done this to get Python to read only the 3rd column of that text file:

inFile = open('test stop codon plus 102.txt', 'rU')
outFile = open('TAG plus 102 reading inframe.txt', 'w')


for line in inFile:
    parts = line.split('\t')
    stopcodon = parts[0]
    skippingcontext = parts[1]
    plus102 = parts[2]`

But I'm not sure where to go next.

Thanks in advance!

Cory Kramer
  • 114,268
  • 16
  • 167
  • 218
lc336
  • 23
  • 3
  • depending on what information you need at the end, but if you only need to know if the stop codon is there, a match with a regex of the form `^(.{3})*((TAG)|(TGA)|(TAA))` should work – njzk2 Nov 19 '14 at 14:19
  • Does 'chunks of threes' mean *anywhere* or only at positions that divide by three (0, 3, 6, 9, ...)? –  Nov 19 '14 at 14:22
  • @LutzHorn yes it means positions that divide by 3, I want it to start from the beginning and read the sequence as 'CTC', 'GTG', 'GTC' etc, then tell me if it finds any stop codon in the sequence and then how many of those stop codons it finds, if any. – lc336 Nov 19 '14 at 14:26
  • Then please check if my answer below is correct. –  Nov 19 '14 at 14:31
  • I have tested it and it seems to work but then it returns an error: 'stopcodon, skippingcontext, plus102 = line.split() ValueError: need more than 0 values to unpack' – lc336 Nov 19 '14 at 14:41

4 Answers4

1

I am not sure if I understand your question but you can try this.

Python:

input = """TAG GTTAGCT CTCGTGGTCCTCAAGGACTCAGAAACCAGGCTCGAGGCCTATCCCAGCAAGTGCTGCTCTGCTCTGCCCACCCTGGGTTCTGCATTCCTATGGGTGACCC
TAG GTTAGCT CTTATTCCCAGTGCCAGCTTTCTCTCCTCACATCCTCATAATGGATGCTGACTGTGTTGGGGGACAGAAGGGACTTGGCAGAGCTTTGCTCATGCCACTC
TAG GTTAGCT CTATTGTGTAACTGAGCAATTCTTTTCACTCTTGTGACTATCTCAGTCCTCTGCTGTTTTGTAACTGGTTTACCTCTATAGTTTATTTATTTTTAAATTA"""

for line in input.split("\n"):
    print(line)
    stopcodon, skippingcontext, plus102 = line.split()
    words = [plus102[s:s+3] for s in range(0, len(line.strip()) - 3)]
    for stopword in ["TAG", "TGA", "TAA"]:
        c = words.count(stopword)
        print("{} {}".format(stopword, c))

Output:

TAG GTTAGCT CTCGTGGTCCTCAAGGACTCAGAAACCAGGCTCGAGGCCTATCCCAGCAAGTGCTGCTCTGCTCTGCCCACCCTGGGTTCTGCATTCCTATGGGTGACCC
TAG 0
TGA 1
TAA 0
TAG GTTAGCT CTTATTCCCAGTGCCAGCTTTCTCTCCTCACATCCTCATAATGGATGCTGACTGTGTTGGGGGACAGAAGGGACTTGGCAGAGCTTTGCTCATGCCACTC
TAG 0
TGA 1
TAA 1
TAG GTTAGCT CTATTGTGTAACTGAGCAATTCTTTTCACTCTTGTGACTATCTCAGTCCTCTGCTGTTTTGTAACTGGTTTACCTCTATAGTTTATTTATTTTTAAATTA
TAG 1
TGA 2
TAA 3
0

You already have the plus102 part, okay. Are you sure about "I need it to read in chunks of threes"? Then, that is your question, and this question has already been answered on SO:

Community
  • 1
  • 1
Dr. Jan-Philip Gehrcke
  • 33,287
  • 14
  • 85
  • 130
  • I have looked at these questions before I wrote mine and they didn't provide the right answer. I the answers above both helped though. – lc336 Nov 19 '14 at 14:10
0

To read the 102nt sequence 3 by 3:

by3 = [plus102[i:i+3] for i in range(0,len(plus102),3)]

To find the position (in the sequence) of stop codons in it:

stops = [(3*i,x) for i,x in enumerate(by3) if x in ["TAG","TGA","TAA"]]

Do you need to consider the phase also?

To write to file:

g = open("outfile.txt", "w")
for (i,x) in stops:
    g.write("Stop codon " + x + " found at position " + str(i) + "\n")
g.close()

You may consider string formatting, a tab-delimited output (see join), etc.

JulienD
  • 7,102
  • 9
  • 50
  • 84
  • Not sure what is meant by the phase, but your answer has very nearly helped, I just need to know now how to write this to a file so I would like something along the lines of 'TAG 3 found, TAA 0 found'.. etc in a separate file by doing 'outFile.write(.....)'. Would you happen to know how I could do this? Thanks. – lc336 Nov 19 '14 at 14:21
  • Updated. The phase is the shift from the start of the sequence (can be 0, 1 or 2). For instance if the sequence is AATCGACCA..., it can be split into codons [AAT,CGA,CCA], or [ATC,GAC,CA.] or [TCG,ACC,A..], depending or where the reading frame starts. – JulienD Nov 19 '14 at 19:00
  • Ah thank you, yes I do need to consider that, it needs to start right from the beginning and then be split into three from there i.e. in your example in the previous comment, I'd need it to be split [AAT, CGA, CCA...]. Your update code has seemed to work so thank you very much. – lc336 Nov 19 '14 at 20:37
0

If you simply want to count the number of "TAG", "TGA", and "TAA" in plus102

import re
numberOfCodons = len(re.findall(r'(TAG|TGA|TAA)'), plus102)

Note: This gets all non-overlapping matches of pattern in string, as a list of strings (refer here), not just in chunks of three.

mrlnt
  • 1