Non-solution
Just remove the look-ahead. The match will consume the text, and disallow the matched text from being matched again (which gives extra unwanted results).
'(ATG(?:[ATGC]{3}){%d,}?(?:TAG|TAA|TGA))' % (aa)
I assume your requirement is to find all sequences, except those that ends at the same index but is shorter than an existing sequence.
Solution 1: Building on top of current solution
Note that your current regex will still allow invalid sequence to be matched when ATG
is too close to the end codon. You still need to use a negative look-ahead to prevent invalid sequences. Then there is no longer a need for the lazy quantifier.
'(?=(ATG(?:(?!TAG|TAA|TGA)[ATGC]{3}){%d,}(?:TAG|TAA|TGA)))' % (aa)
You can then post-process all the matches and filter out unwanted matches. You should record all the matches with the corresponding start and end indices. Sort the matches by the end index, and for each of the end index, keep only the match with the smallest start index.
Solution 2: Reverse the string and use regex
It is possible to do so by first reversing the sequence, and iterate through the matches of the following regex:
'(?=((?:GAT|AAT|AGT)(?:(?!GAT|AAT|AGT)[ATGC]{3}){%d,}GTA))' % (aa)
The regex uses negative look-ahead to ensure that there is no end-codons inside the sequence, and the quantifier is made greedy to get the longest instance.
The effect is not replicable with the normal order of the sequence. Since you requires the indices of the ending codons to be unique, I make use of the fact that there can only be one match per index to enforce that condition. There is no way to enforce unique ending position with the level of support in re
module.
You don't need to reverse the string if you use regex
module. You only need to set the REVERSE
flag to enable reverse searching with the same regex as above (not tested).