Why is my positive lookahead assertion consuming the string and not matching correctly?

Question

I'm trying to find all the occurances of a substring inside a string and print their start and end index using regular expression.

For example, string = 'bbbcbb' sub = 'bb' I must get (0,1) (1,2) (4,5) as my output.

My code:

import re
matches = list(re.finditer(r'bb(?=[a-zA-Z]|$)', 'bbbcbb'))

The output:

[<_sre.SRE_Match object; span=(0, 2), match='bb'>, 
<_sre.SRE_Match object;span=(4, 6), match='bb'>]

I went through the documentation on https://docs.python.org/3/library/re.html and to my understanding the lookahead assertion will work by

At postion 0, it will match 'bb' with "bb" followed by "b" .i.e. bbbcbb
At postion 1, it will match 'bb' with "bb" followed by "c" .i.e. bbbcbb
Then it will not match till postion 4 where it will match 'bb' with "bb" followed by $ .i.e. bbbcbb

Why is the lookahead assertion ignoring the b'bb'cbb at the (1,3) position? Or is my understanding of the lookahead assertion flawed?

Algorithm you've described can be done with `(?=bb)` regex. If you want it to capture `bb` use `(?=(bb))`. [Link](https://regex101.com/r/7dahEz/1) — Olvin Roght, Jul 18 '19 at 07:13

score 1 · Answer 1 · answered Jul 18 '19 at 07:27

1

This has nothing to do with your lookahead, and is caused by re not returning overlapping matches. Here's a simpler example:

import re

regex = re.compile("aa")
results = list(regex.finditer("aaaa"))
#  You expect to get (0, 2), (1, 3), (2, 4)
print(results)
>>> [<_sre.SRE_Match object; span=(0, 2), match='aa'>,
     <_sre.SRE_Match object; span=(2, 4), match='aa'>]

The correct way to do this is by using groups and a lookahead, as explained here: Python regex find all overlapping matches?

answered Jul 18 '19 at 07:27

amdex

761
3
10

This will not return the index of the matched substrings. I need to find the indices of the matched substrings. – Arijit Dutta Jul 18 '19 at 10:54
You can use `groups` to find the groups that were captured and add those to the start indices of the match. – amdex Jul 18 '19 at 11:04
I tried this out. But the groups function returns the substring itself. What I did was add the length of the string - 1 to the start index of the match. – Arijit Dutta Jul 19 '19 at 05:46
1

You can add another group to find the actual string `regex = re.compile("(?=(aa))")` and then add `len(x.groups(1)[0]))` to the start index. This makes you code a bit less reliant on the thing you are trying to find. – amdex Jul 19 '19 at 07:40

score 1 · Answer 2 · answered Jul 18 '19 at 08:00

The pattern 'bb(?=[a-zA-Z]|$) will match 2 characters instead of 1 asserting that what is on the right is a character a-z or the end of the string.

Using re.finditer, you might update your pattern to match a single b and put a single b in the positive lookahead:

import re
matches = list(re.finditer(r'b(?=b)', 'bbbcbb'))
for m in matches:
    print(m.span())

Result

(0, 1)
(1, 2)
(4, 5)

Why is my positive lookahead assertion consuming the string and not matching correctly?

My code:

The output:

2 Answers2