Bug in my code: identifying sequence within another sequence

Question

My current code:

import re
from Bio.Seq import Seq


def check_promoter(binding_element,promoter_seq):
promoter_seq = str(promoter_seq)
        residues = list()
        for i in range(0,len(promoter_seq)):
            if binding_element[0] == promoter_seq[i]:
                ind = promoter_seq[i]
                for j in range(0,len(binding_element)):
                    if binding_element[0+j] == promoter_seq[i+j-len(binding_element)]:
                        residues.append(i+j-len(binding_element))
        return residues 


ESR1_promoter = Seq('''aagtcaggctgagagaatctcagaaggttgtggaagggtctatctacttt\
gggagcattttgcagaggaagaaactgaggtcctggcaggttgcattctc\
ctgatggcaaaatgcagctcttcctatatgtataccctgaatctccgccc\
ccttcccctcagatgccccctgtcagttcccccagctgctaaatatagct\
gtctgtggctggctgcgtatgcaaccgcacaccccattctatctgcccta\
tctcggttacagtgtagtcctccccagggtcatcctatgtacacactacg\
tatttctagccaacgaggagggggaatcaaacagaaagagagacaaacag\
agatatatcggagtctggcacggggcacataaggcagcacattagagaaa\
gccggcccctggatccgtctttcgcgtttattttaagcccagtcttccct\
gggccacctttagcagatcctcgtgcgcccccgccccctggccgtgaaac\
tcagcctctatccagcagcgacgacaagtaaagtaaagttcagggaagct\
gctctttgggatcgctccaaatcgagttgtgcctggagtgatgtttaagc\
caatgtcagggcaaggcaacagtccctggccgtcctccagcacctttgta\
atgcatatgagctcgggagaccagtacttaaagttggaggcccgggagcc\
caggagctggcggagggcgttcgtcctgggactgcacttgctcccgtcgg\
gtcgcccggcttcaccggacccgcaggctcccggggcagggccggggcca\
gagctcgcgtgtcggcgggacatgcgctgcgtcgcctctaacctcgggct\
gtgctctttttccaggtggcccgccggtttctgagccttctgccctgcgg\
ggacacggtctgcaccctgcccgcggccacggaccatgaccatgaccctc\
cacaccaaagcatctgggatggccctactgcatcagatccaagggaacga''')
ESR1_complement = ESR1_promoter.complement()

SBE = 'CAGACA'

print check_promoter(SBE,ESR1_promoter)
print check_promoter(SBE,ESR1_complement)

This code works when I test with the string 'aa' and returns a list of the index where 'aa' was found but when I test with other sequences (i.e. 'tcc') it finds no matches when clearly there is a 'tcc' in the sequence. Further, the string 'CAGACA' was identified with the re.findall method in the complement string, but this does not provide an index.

Can anybody suggest what I'm doing wrong?

Also, a secondary problem - as you can see I have cheated a little bit since my code will only check the first:

promoter_seq[i+j-len(binding_element)]

elements since I get an index error. Does anybody know a way around this?

Thanks

you use the `Bio` modules class `Seq`. It seems it has defined `[]` operator, however it might not return what you expect it to. Without confirming what it returns we can't help you. Additionally you're searching for "cagaca" which isn't even in that list, meanwhile "aa" is. As far as I see this function works perfectly. That is unless of course `Bio.Seq` doesn't do something to it's input string to transform it to something else, but then again to answer that I'd have to know the package and I don't. — ljetibo, Mar 06 '15 at 12:39
Hi there, So you're right, I only used the Bio.Seq module to get the complement string (in biology A complements T and G complements C...always) but I forgot to change it back. Either way I don't think this is changing the behavior as I get the same return. Also, the string 'CAGACA' was identified in the complement strand using the re.findall method — CiaranWelsh, Mar 06 '15 at 12:54
How about we start from the scratch? So, you've got this long string that has no "cagaca" in it. You created a complement of the sequence. You found a hit for "cagaca" in that complement using `re.seq` but can't figure out why your function doesn't? If that is your question, it's because you send in the original string `print check_promoter(SBE,ESR1_promoter)` and not an `ESR1_complement`. If it's not, sorry. See [this](http://stackoverflow.com/questions/4664850/find-all-occurrences-of-a-substring-in-python) on how to use regex to the same effect. — ljetibo, Mar 06 '15 at 13:08

score 3 · Accepted Answer · answered Mar 06 '15 at 13:37

I'm surprised that there is no preexisting function in Bio to do this type of search - it would seem a very common operation. Perhaps you need to spend some time with the documentation.

Anyway, you could just use re.finditer() which will return an iterator returning match objects:

import re
from Bio.Seq import Seq

def check_promoter(binding_element, promoter_seq):
    return [m.start() for m in
               re.finditer(str(binding_element).lower(),
                           str(promoter_seq).lower())]

ESR1_promoter = Seq('aagtcaggctgagagaatctcagaaggttgtggaagggtctatctactttgggagcattttgcagaggaagaaactgaggtcctggcaggttgcattctcctgatggcaaaatgcagctcttcctatatgtataccctgaatctccgcccccttcccctcagatgccccctgtcagttcccccagctgctaaatatagctgtctgtggctggctgcgtatgcaaccgcacaccccattctatctgccctatctcggttacagtgtagtcctccccagggtcatcctatgtacacactacgtatttctagccaacgaggagggggaatcaaacagaaagagagacaaacagagatatatcggagtctggcacggggcacataaggcagcacattagagaaagccggcccctggatccgtctttcgcgtttattttaagcccagtcttccctgggccacctttagcagatcctcgtgcgcccccgccccctggccgtgaaactcagcctctatccagcagcgacgacaagtaaagtaaagttcagggaagctgctctttgggatcgctccaaatcgagttgtgcctggagtgatgtttaagccaatgtcagggcaaggcaacagtccctggccgtcctccagcacctttgtaatgcatatgagctcgggagaccagtacttaaagttggaggcccgggagcccaggagctggcggagggcgttcgtcctgggactgcacttgctcccgtcgggtcgcccggcttcaccggacccgcaggctcccggggcagggccggggccagagctcgcgtgtcggcgggacatgcgctgcgtcgcctctaacctcgggctgtgctctttttccaggtggcccgccggtttctgagccttctgccctgcggggacacggtctgcaccctgcccgcggccacggaccatgaccatgaccctccacaccaaagcatctgggatggccctactgcatcagatccaagggaacga')
ESR1_complement = ESR1_promoter.complement()

SBE = 'CAGACA'

>>> check_promoter(SBE, ESR1_promoter)
[]
>>> check_promoter(SBE, ESR1_complement)
[200]
>>> check_promoter('tcc', ESR1_promoter)
[80, 98, 121, 143, 153, 177, 267, 270, 282, 413, 445, 467, 510, 565, 622, 632, 635, 723, 741, 778, 860, 948, 987]
>>> check_promoter('TCC', ESR1_promoter)
[80, 98, 121, 143, 153, 177, 267, 270, 282, 413, 445, 467, 510, 565, 622, 632, 635, 723, 741, 778, 860, 948, 987]

>>> check_promoter(Seq('CAGACA'), ESR1_complement)

Note that the binding_element can be a Seq or a string and that it is case sensitive, so it is converted to lower case for searching as is promoter_seq.

Bug in my code: identifying sequence within another sequence

1 Answers1