7

I'm having some trouble with the re.finditer() method in python. For example:

>>>sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
>>>[[m.start(),m.end()] for m in re.finditer(r'(?=gatttaacg)',sequence)]

out: [[22,22]]

As you can see, the start() and end() methods are giving the same value. I've noticed this before and just ended up using m.start()+len(query_sequence), instead of m.end(), but I am very confused why this is happening.

lstbl
  • 527
  • 5
  • 17

4 Answers4

6

The regex module supports overlapping with finditer :

import regex
sequence = 'acaca'
print [[m.start(), m.end()] for m in regex.finditer(r'(aca)', sequence, overlapped=1)]
[0, 3], [2, 5]]
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
2
sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
print [[m.start(),m.end()] for m in re.finditer(r'(gatttaacg)',sequence)]

remove the lookahead .It does not capture only asserts.

Output:[[22, 31]]

if you have to use lookahead use

sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
print [[m.start(),m.start()+len("aca")] for m in re.finditer(r'(?=aca)',sequence)]
vks
  • 67,027
  • 10
  • 91
  • 124
  • 3
    A clarification: this means that when `re.finditer` matches `gatttaacg`, the actual values aren't consumed. It only goes to that position and says "yes, what is ahead of this position is the string requested". – Alyssa Haroldsen Jan 13 '16 at 18:21
  • so the end() function is essentially worthless if you are using a lookahead? – lstbl Jan 13 '16 at 18:24
  • 1
    @lstbl `lookaheads` dont consume any string.They just check a string after a position.So you get the start but regex engine remains there as no string is consumed thereafter – vks Jan 13 '16 at 18:25
1

As specified, you are required to find overlapping matches and need the lookahead. However, you appear to know the exact string you're looking for. How about this?

def find_overlapping(sequence, matchstr):
    for m in re.finditer('(?={})'.format(matchstr)):
        yield (m.start(), m.start() + len(matchstr))

Alternatively, you could use the third-party Python regex module, as described here.

Community
  • 1
  • 1
Alyssa Haroldsen
  • 3,652
  • 1
  • 20
  • 35
  • This works. It was what I was using originally. I'm still a bit confused why the regex engine can't successfully determine how long the match is (for example if the length of the regex match was not known apriori, like it is in my case), since it is still able to determine if there is a match. – lstbl Jan 13 '16 at 18:29
  • 1
    @lstbl, it can. `m.start()` and `m.end()` refer to the spans of group 0, which is empty. So you are just misinterpreting the API. – Kijewski Jan 13 '16 at 18:32
1

If the length of the subsequence is not known a-priori, then you can use a matching group inside the lookahead and take its span:

[m.span(1) for m in re.finditer(r'(?=(gatttaacg))',sequence)] == [(22,31)]

E.g. to find all repeated characters:

[m.span(1) for m in re.finditer(r'(?=(([acgt])\2+))',sequence)]
Kijewski
  • 25,517
  • 12
  • 101
  • 143