how to find a three letter in a sequence?

Question

I have a sequence as follows :

my_file_m= "TCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTA
TCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAA
GATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGG
AGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT"

I would like to find where and how many specific three letters are , TAA, TGA and TAG . If there are any, I would like to color them up.

I started by loading the letters

my_file = open(my_file_m)
mine = my_file.read()
print(mine)

I could not use .count nor did I could use find because I have three inputs. Is there any idea how to find them and highlight them ?

"_[I couldn't] use find because I have three inputs_". Just call `find` three times, once per input. — Kevin, Feb 12 '15 at 15:32
Split the string three times using regex for each of those words. Your desired output will be the length of each of the splitted string - 1 (for respective words you used for splitting). — ha9u63a7, Feb 12 '15 at 15:37
@akrun I felt like stupid, in a few second i received over 8 comments! I thought the question is very stupid that people have never faced! but if you think it is OK question, I can ask again — , Feb 17 '15 at 12:27
I've had this one as an interview question to solve on a whiteboard — lxx, May 18 '15 at 05:19

styvane · Answer 1 · 2015-02-12T16:03:41.143

4

Using re.findall function and collection.Counter from the the standard library

import re
from collections import Counter

pat = re.compile(r"(TAA|TGA|TAG)")
c = re.findall(pat,my_file_m)

print(c)
print(Counter(c))

Output

['TGA', 'TGA', 'TAA', 'TAG', 'TGA', 'TGA', 'TGA', 'TAA']
Counter({'TGA': 5, 'TAA': 2, 'TAG': 1})

edited Feb 12 '15 at 16:03

answered Feb 12 '15 at 15:43

styvane

59,869
19
150
156

score 4 · Accepted Answer · edited May 23 '17 at 12:15

Here is my solution to your question:

Note: This code also finds overlapping sequences. Depending on whether you want to allow overlapping or not you will have to remove '?='

import re 

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

my_file_m= '''TTCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTATCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAAGATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGGAGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT'''


pat = re.compile(r'(?=(TAA|AAT|TGA|TAG))') # Very important, if you do not need overlaps then remove '?='
matches = re.finditer(pat,my_file_m)
result1 = [int(match.start(1)) for match in matches] # find all the starting positions of the string
result2 = [range(x,x+3) for x in result1 ] # find all the positions of the characters (given that we search for patterns of length 3, can be modified for other lengths too )
result3 = set().union(*result2) # generate a union

for chari in range(len(my_file_m)): # colorize based on if it is in a sequence or not
    if(chari in result3):
        print bcolors.OKGREEN + my_file_m[chari]  + bcolors.ENDC,
    else:
        print my_file_m[chari],

Cleaner:

import re 
import sys

my_file_m= '''TAATTCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTATCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAAGATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGGAGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT'''

pat = re.compile(r'(?=(TAA|TGA|TAG))') # Very important, if you do not need overlaps then remove '?='
lettersToColor = set().union(*[range(m.start(1),m.start(1)+3) for m in re.finditer(pat, my_file_m)])

for chari in range(len(my_file_m)): # colorize based on if it is in a sequence or not
    if(chari in lettersToColor):
        sys.stdout.write('\033[92m' + my_file_m[chari]  +'\033[0m')
    else:
        sys.stdout.write(my_file_m[chari])

Credit to : here and here

Output: enter image description here

Aaron · Answer 3 · 2015-02-12T16:05:17.693

0

Do you need to split the DNA sequence by every three letter to map the genetic code?

If so, see the following code.

my_file_m= '''TCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTA
TCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAA
GATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGG
AGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT'''

mm = "".join(my_file_m.split())                 # delete the new line characters

messenger = map(''.join, zip(*[iter(mm)]*3))    # split every three letters

print messenger.count('TAA')
print messenger.count('TGA')
print messenger.count('TAG')

output

0
1
0

edited Feb 12 '15 at 16:05

answered Feb 12 '15 at 15:53

Aaron

2,383
3
22
53

Thanks @Aaron, seems very interesting commands you used , thanks for your comment, I liked it already ;-) however, why did you use the .split ? did you try to remove '''' characters ? – Feb 12 '15 at 16:10
In fact, there is a `\n` character at the end of the first three lines. `\n` means new line. I use `split` to remove the new line characters. – Aaron Feb 12 '15 at 16:13
1

If you want to find the location of 'TAA', you can `import re` `[m.start(0) for m in re.finditer("TAA", my_file_m)]` It will show the index of the occurrence of 'TAA'. – Aaron Feb 12 '15 at 16:25

how to find a three letter in a sequence?

3 Answers3