re.finditer() returning same value for start and end methods

Question

I'm having some trouble with the re.finditer() method in python. For example:

>>>sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
>>>[[m.start(),m.end()] for m in re.finditer(r'(?=gatttaacg)',sequence)]

out: [[22,22]]

As you can see, the start() and end() methods are giving the same value. I've noticed this before and just ended up using m.start()+len(query_sequence), instead of m.end(), but I am very confused why this is happening.

I'm using the lookahead because I want overlapping matches. For example if I was searching for aca, I'd want acaca to count as 2 occurrences instead of 1 — lstbl, Jan 13 '16 at 18:21
Well, you know the length of the requested sequence, so why do you even need `m.end()`? — Kijewski, Jan 13 '16 at 18:23
Ok then removing the lookahead is not going to work, just adding the length to start is your only option — Padraic Cunningham, Jan 13 '16 at 18:23
The lookahead isn't part of the match. The match starts and ends at position 22, even though the fact that it's a match depends on characters after that. — user2357112, Jan 13 '16 at 18:30

score 6 · Accepted Answer · answered Jan 13 '16 at 18:28

6

The regex module supports overlapping with finditer :

import regex
sequence = 'acaca'
print [[m.start(), m.end()] for m in regex.finditer(r'(aca)', sequence, overlapped=1)]
[0, 3], [2, 5]]

answered Jan 13 '16 at 18:28

Padraic Cunningham

176,452
29
245
321

1

Beautiful answer. This makes sense. – lstbl Jan 13 '16 at 18:30
3

@lstbl: Note that the `regex` module in this answer is an entirely different module from the standard-library `re` module you're using. – user2357112 Jan 13 '16 at 18:33
2

I did not notice that – lstbl Jan 13 '16 at 18:37
@lstbl you need to install `regex` module through `pip` or something and you are good to go – vks Jan 13 '16 at 19:03

vks · Answer 2 · 2016-01-13T18:24:30.440

2

sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
print [[m.start(),m.end()] for m in re.finditer(r'(gatttaacg)',sequence)]

remove the lookahead .It does not capture only asserts.

Output:[[22, 31]]

if you have to use lookahead use

sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
print [[m.start(),m.start()+len("aca")] for m in re.finditer(r'(?=aca)',sequence)]

edited Jan 13 '16 at 18:24

answered Jan 13 '16 at 18:18

vks

67,027
10
91
124

3

A clarification: this means that when `re.finditer` matches `gatttaacg`, the actual values aren't consumed. It only goes to that position and says "yes, what is ahead of this position is the string requested". – Alyssa Haroldsen Jan 13 '16 at 18:21
so the end() function is essentially worthless if you are using a lookahead? – lstbl Jan 13 '16 at 18:24
1

@lstbl `lookaheads` dont consume any string.They just check a string after a position.So you get the start but regex engine remains there as no string is consumed thereafter – vks Jan 13 '16 at 18:25

score 1 · Answer 3 · edited May 23 '17 at 12:33

1

As specified, you are required to find overlapping matches and need the lookahead. However, you appear to know the exact string you're looking for. How about this?

def find_overlapping(sequence, matchstr):
    for m in re.finditer('(?={})'.format(matchstr)):
        yield (m.start(), m.start() + len(matchstr))

Alternatively, you could use the third-party Python regex module, as described here.

edited May 23 '17 at 12:33

Community

1
1

answered Jan 13 '16 at 18:25

Alyssa Haroldsen

3,652
1
20
35

This works. It was what I was using originally. I'm still a bit confused why the regex engine can't successfully determine how long the match is (for example if the length of the regex match was not known apriori, like it is in my case), since it is still able to determine if there is a match. – lstbl Jan 13 '16 at 18:29
1

@lstbl, it can. `m.start()` and `m.end()` refer to the spans of group 0, which is empty. So you are just misinterpreting the API. – Kijewski Jan 13 '16 at 18:32

Kijewski · Answer 4 · 2016-01-13T18:48:50.203

1

If the length of the subsequence is not known a-priori, then you can use a matching group inside the lookahead and take its span:

[m.span(1) for m in re.finditer(r'(?=(gatttaacg))',sequence)] == [(22,31)]

E.g. to find all repeated characters:

[m.span(1) for m in re.finditer(r'(?=(([acgt])\2+))',sequence)]

edited Jan 13 '16 at 18:48

answered Jan 13 '16 at 18:30

Kijewski

25,517
12
101
143

re.finditer() returning same value for start and end methods

4 Answers4