Why is re.findall not being specific in finding triplet items in string. Python

Question

So I have four lines of code

seq= 'ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'



OR_0 = re.findall(r'ATG(?:...){9,}?(?:TAA|TAG|TGA)',seq)

Let me explain what I am attempting to do first . . . I'm sorry if this confusing but I am going to try my best to explain it.

So I'm looking for sequences that START with 'ATG' followed by units of 3 of any word char [e.g. 'GGG','GTT','TTA',etc] until it encounters either an 'TAA','TAG' or 'TGA' I also want them to be at least 30 characters long. . . hence the {9,}?

This works to some degree but if you notice in seq that there is ATG GAA GTT GGA TGA AAG TGG AGG TAA AGA GAA GAC GTT TGA

So in this case, it should be finding 'ATGGAAGTTGGATGA' if it starts with the first 'ATG' and goes until the next 'TAA','TAG' or 'TGA'

HOWEVER when you run the OR_0 line of code, it spits back out the entire seq string. I don't know how to make it only consider the first 'TAA','TAG' or 'TGA' followed by the first 'ATG'

If an 'ATG' is followed by another 'ATG' when read in units of 3 then that is alright, it should NOT start over but if it encounters a 'TAA','TAG' or 'TGA' when read in units of 3 it should stop.

My question, why is re.findall finding the longest sequence of 'ATG'xxx-xxx-['TAA','TAG' or 'TGA'] instead of the first occurrence of 'TAA','TAG' or 'TGA' after an ATG separated by word characters in units of 3 ?

Once again, I apologize if this is confusing but its messing with multiple data sets that I have based on this initial line of text and i'm trying to find out why

This would do for `OR_0` : `ATG[ATG]{3}(.*?)[ATG]` ? To match until the latest found would be `ATG[ATG]{3}(.*)[ATG]`, but if you search for the string multiple times, you will need a certain separator (like a comma or line break) to know where to end. With line breaks; *not using* `DOTALL` (`//s`) would suffice. — , Apr 28 '13 at 07:57
i'm not working with gene patenting business i'm just a biology student getting into bioinformatics @ealfonso — O.rka, Apr 28 '13 at 08:07
What should be the expected result for the input you showed? AFAIK the result you are getting is correct because any other match would be too short. — Bakuriu, Apr 28 '13 at 08:26
For your example, according to your description (and from some basic knowledge about DNA), for your sample input, it should not return any result. The only valid sequence is `ATGGAAGTTGGATGA` (cannot be longer, since it has been terminated by `TGA`), and the part inbetween only contains 3 codons (9 character long). — nhahtdh, Apr 28 '13 at 11:34

score 2 · Accepted Answer · answered Apr 28 '13 at 08:44

If you want your regex to stop matching at the first TAA|TAG|TGA, but still only succeed if there are at least nine three letter chunks, the following may help:

>>> import re
>>> regexp = r'ATG(?:(?!TAA|TAG|TGA)...){9,}?(?:TAA|TAG|TGA)'
>>> re.findall(regexp, 'ATGAAAAAAAAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAAAAAAAAAAAAAAAAAAAAAAAAATAG']
>>> re.findall(regexp, 'ATGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAG']
>>> re.findall(regexp, 'ATGAAATAGAAAAAAAAAAAAAAAAAAAAATAG')
[]

This uses a negative lookahead (?!TAA|TAG|TGA) to ensure that a three character chunk is not a TAA|TAG|TGA before it matches the three character chunk.

Note though that a TAA|TAG|TGA that does not fall on a three character boundary will still successfully match:

>>> re.findall(regexp, 'ATGAAAATAGAAAAAAAAAAAAAAAAAAAATAG')
['ATGAAAATAGAAAAAAAAAAAAAAAAAAAATAG']

Actually, `?` in `{9,}?` in not necessary. The negative look-ahead has already disallow the ending sequences to be part of the repeated portion, so we can do a greedy match. Aside from that, good solution so +1 — nhahtdh, Apr 28 '13 at 09:46
@nhahtdh You're right, the greedy `{9,}?` is now redundant. I forgot to remove it when I copied the pattern from the question. — Tim Heap, Apr 29 '13 at 04:42

score 1 · Answer 2 · answered Apr 28 '13 at 08:36

1

If the length is not a requirement then it's pretty easy:

>>> import re
>>> seq= 'ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'
>>> regex = re.compile(r'ATG(?:...)*?(?:TAA|TAG|TGA)')
>>> regex.findall(seq)
['ATGGAAGTTGGATGA']

Anyway I believe, according to your explanation, that your previous regex is actually doing what you want: searching for matches of at least 30 characters that start in ATG and end in TGA.

In your question you first state that you need matches of at least 30 characters, and hence you put the {9,}?, but after that you expect to match any match. You cannot have both, choose one. If length is important than keep the regex you already have and the result you are getting is correct.

answered Apr 28 '13 at 08:36

Bakuriu

98,325
22
197
231

The regex in the question does not stop at the first occurrence of ending sequence (counting from the starting sequence ATG). – nhahtdh Apr 28 '13 at 09:49
@nhahtdh By what I understand reading the question that's not what the OP wants. As I already said in a comment he should provide more example of input/expected output, since having three lines of examples would be worth a thousand lines of description by words... – Bakuriu Apr 28 '13 at 11:19
That's true - his example is not consistent with his description. (Anyway, it is best to have both example and description - example to quickly show what one wants to match, and description to cover the process). – nhahtdh Apr 28 '13 at 11:37
oh i see! ok yes that makes sense . . . its looking for ATG followed by at least 10 sets of 3 and then TAA,TGA, or TAG. Those triplets in between can be a TAA,TGA, or TAG by the way the code is set up. i see the problem now – O.rka Apr 28 '13 at 16:49

Inbar Rose · Answer 3 · 2013-04-28T08:36:09.313

0

You don't need regular expressions.

def chunks(l, n):
    """ Yield successive n-sized chunks from l.
    from: http://stackoverflow.com/a/312464/1561176
    """
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

def method(sequence, start=['ATG'], stop=['TAA','TAG','TGA'], min_len=30):
    response = ''
    started = False
    for x in chunks(sequence, 3):
        if x in start:
            started = True
            response += x
        elif x in stop and started:
            if len(response) >= min_len:
                yield response + x
                response = ''
                started = False
            else:
                response += x
        elif started:
            response += x
    yield response

for result in method('ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'):
    print result

If I use the min_len of 30, the return is:

ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA

If I use a min_len of 0, the return is:

ATGGAAGTTGGATGA

edited Apr 28 '13 at 08:36

answered Apr 28 '13 at 08:25

Inbar Rose

41,843
24
85
131

The result you are getting is wrong. The OP clearly stated that "I also want them to be at least 30 characters long. . . hence the `{9,}?`", and I don't think the result you are getting is 30 characters long. – Bakuriu Apr 28 '13 at 08:31
The OP says the result should be the result I am getting. "So in this case, it should be finding 'ATGGAAGTTGGATGA' if it starts with the first 'ATG' and goes until the next 'TAA','TAG' or 'TGA'" . However, if the requirement is minimum 30, there is no problem to add that to this answer. But I would want to know what happens if a valid match is found that is not the length requirement. – Inbar Rose Apr 28 '13 at 08:31
That's why I commented also his question. It's not clear what he wants. Also, if the length is not a requirement simply replacing the `{9,}?` with `*?` would suffice. – Bakuriu Apr 28 '13 at 08:34

score 0 · Answer 4 · answered Apr 28 '13 at 08:40

0

Try this:

seq= 'ATGGAAGTTGGATGAAAGTGGAGGTAAAGAGAAGACGTTTGA'
OR_0 = re.findall(r'ATG(?:.{3})*?(?:TAA|TAG|TGA)',seq)

answered Apr 28 '13 at 08:40

perreal

94,503
21
155
181

The requirement of at least 30 characters is not met. – nhahtdh Apr 28 '13 at 09:48

Why is re.findall not being specific in finding triplet items in string. Python

4 Answers4