0

I'll explain what I want using an example. I'm working with DNA sequences. Let's say I want to remove everything between GUA and CAG(including GUA and CAG) in a string. So if the input is : "AAAAGUAGGGGCAGCAGUUUUUGUAAAAACAG"

The output should be : ["AAAA","CAGUUUUU"]. I initially used re.split(r'GUA\w*CAG',a), but that returns ["AAAA"]. It seems to look for the last occurrence of CAG in the string instead of the first occurrence.

  • What should happen with `AAAGUAGGGGUAUUUCAG`? Should the first or the second `GUA` count? Also, shouldn't you make sure that the number of bases between the two markers is divisible by 3? – Tim Pietzcker Aug 24 '14 at 13:12

2 Answers2

2

In regex, by default *, + and ? are greedy.

If you don't want that behavior, use their non-greedy counterparts *?, +? and ??:

re.split(r'GUA\w*?CAG',a)

See https://docs.python.org/2/library/re.html#regular-expression-syntax

Community
  • 1
  • 1
Sylvain Leroux
  • 50,096
  • 7
  • 103
  • 125
0

You need to add a quantifier ? and also it's better to use .*? instead \w*? because \w matches only the word charcters.

>>> import re
>>> s = "AAAAGUAGGGGCAGCAGUUUUUGUAAAAACAG"
>>> m = re.split(r'GUA.*?CAG', s)
>>> m
['AAAA', 'CAGUUUUU', '']
>>> [x for x in m if x]
['AAAA', 'CAGUUUUU']
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274