1

I have given a string

ATGCCAGGCTAGCTTATTTAA

and I have to find out all substrings in string which starts with ATG and end with either of TAA, TAG, TGA.

Here is what I am doing:

seq="ATGCCAGGCTAGCTTATTTAA"
pattern = re.compile(r"(ATG[ACGT]*(TAG|TAA|TGA))")
for match in re.finditer(pattern, seq):
    coding = match.group(1)
    print(coding)

This code is giving me output:

ATGCCAGGCTAGCTTATTTAA

But actual output should be :

ATGCCAGGCTAGCTTATTTAA, ATGCCAGGCTAG

what I should change in my code?

PMende
  • 5,171
  • 2
  • 19
  • 26
abhishhh1
  • 54
  • 7
  • 1
    For your current example you could use 2 capturing groups https://regex101.com/r/58Unpp/1 – The fourth bird Nov 06 '19 at 18:24
  • Or to get all the variants I think this could do it `(?=(ATG[ATCG]*(?:T(?:A[AG]|GA))))(ATG[ATCG]*?(?:T(?:A[AG]|GA)))` https://regex101.com/r/RaCUMS/1 – The fourth bird Nov 06 '19 at 18:39
  • No it fails for this **ATGCCAGGTATGTTATTGTAG** string, Output should be: **ATGCCAGGTATGTTATTGTAG** and **ATGTTATTGTAG** – abhishhh1 Nov 06 '19 at 18:44

2 Answers2

1

In r"(ATG[ACGT]*(TAG|TAA|TGA))", the * operator is "greedy". Use the non-greedy modifier, like r"(ATG[ACGT]*?(TAG|TAA|TGA))", to tell the regexp to take the shortest matching string, not the longest.

Kirk Strauser
  • 30,189
  • 5
  • 49
  • 65
1

tl;dr: can't use regex for this


The problem isn't greedy/non-greedy.

The problem isn't overlapping matches either: there's a solution for that (How to find overlapping matches with a regexp?)

The real problem with OP's question is, REGEX isn't designed for matches with the same start. Regex performs a linear search and stops at the first match. That's one of the reasons why it's fast. However, this prevents REGEX from supporting multiple overlapping matches starting at the same character.

See

Regex including overlapping matches with same start

for more info.

Regex isn't the be-all-end-all of pattern matching. It's in the name: Regular expressions are all about single-interpretation symbol sequences, and DNA tends not to fit that paradigm.

Cedar
  • 748
  • 6
  • 21