How to remove characters between the first occurrences of an expression?

Question

I'll explain what I want using an example. I'm working with DNA sequences. Let's say I want to remove everything between GUA and CAG(including GUA and CAG) in a string. So if the input is : "AAAAGUAGGGGCAGCAGUUUUUGUAAAAACAG"

The output should be : ["AAAA","CAGUUUUU"]. I initially used re.split(r'GUA\w*CAG',a), but that returns ["AAAA"]. It seems to look for the last occurrence of CAG in the string instead of the first occurrence.

What should happen with `AAAGUAGGGGUAUUUCAG`? Should the first or the second `GUA` count? Also, shouldn't you make sure that the number of bases between the two markers is divisible by 3? — Tim Pietzcker, Aug 24 '14 at 13:12

score 2 · Accepted Answer · edited May 23 '17 at 12:14

2

In regex, by default *, + and ? are greedy.

If you don't want that behavior, use their non-greedy counterparts *?, +? and ??:

re.split(r'GUA\w*?CAG',a)

See https://docs.python.org/2/library/re.html#regular-expression-syntax

edited May 23 '17 at 12:14

Community

1
1

answered Aug 24 '14 at 13:08

Sylvain Leroux

50,096
7
103
125

you need to trim out the empty string in the list. – Avinash Raj Aug 24 '14 at 13:16

score 0 · Answer 2 · answered Aug 24 '14 at 13:12

You need to add a quantifier ? and also it's better to use .*? instead \w*? because \w matches only the word charcters.

>>> import re
>>> s = "AAAAGUAGGGGCAGCAGUUUUUGUAAAAACAG"
>>> m = re.split(r'GUA.*?CAG', s)
>>> m
['AAAA', 'CAGUUUUU', '']
>>> [x for x in m if x]
['AAAA', 'CAGUUUUU']

How to remove characters between the first occurrences of an expression?

2 Answers2