How to find overlapping matches with a regexp?

Question

>>> match = re.findall(r'\w\w', 'hello')
>>> print match
['he', 'll']

Since \w\w means two characters, 'he' and 'll' are expected. But why do 'el' and 'lo' not match the regex?

>>> match1 = re.findall(r'el', 'hello')
>>> print match1
['el']
>>>

[Lookahead](http://stackoverflow.com/questions/320448/overlapping-matches-in-regex) — Pavan Manjunath, Jul 11 '12 at 10:45

score 142 · Accepted Answer · edited Nov 24 '17 at 23:10

142

findall doesn't yield overlapping matches by default. This expression does however:

>>> re.findall(r'(?=(\w\w))', 'hello')
['he', 'el', 'll', 'lo']

Here (?=...) is a lookahead assertion:

(?=...) matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

edited Nov 24 '17 at 23:10

notpeter

1,046
11
16

answered Jul 11 '12 at 10:44

Otto Allmendinger

27,448
7
68
79

4

But I don't understand why it advances to the next letter if it's inside the positive lookahead assertion. Could you explain, please? – MrZH6 Apr 01 '20 at 21:54
1

@MrZH6 I guess it's due to group capturing (braces around \w\w). The actual match is still an empty string, whereas group 1 is filled with \w\w (as you can test at https://regex101.com/). So I believe it captures it in a group, but doesn't advance past it because the match is zero-length. And python's re.findall will print captured groups https://docs.python.org/3/library/re.html#re.findall – Sviatozar Petrenko Jan 27 '22 at 12:48

score 52 · Answer 2 · answered Sep 23 '13 at 18:54

52

You can use the new Python regex module, which supports overlapping matches.

>>> import regex as re
>>> match = re.findall(r'\w\w', 'hello', overlapped=True)
>>> print match
['he', 'el', 'll', 'lo']

answered Sep 23 '13 at 18:54

David C

7,204
5
46
65

nhahtdh · Answer 3 · 2014-12-18T03:24:05.900

Except for zero-length assertion, character in the input will always be consumed in the matching. If you are ever in the case where you want to capture certain character in the input string more the once, you will need zero-length assertion in the regex.

There are several zero-length assertion (e.g. ^ (start of input/line), $ (end of input/line), \b (word boundary)), but look-arounds ((?<=) positive look-behind and (?=) positive look-ahead) are the only way that you can capture overlapping text from the input. Negative look-arounds ((?<!) negative look-behind, (?!) negative look-ahead) are not very useful here: if they assert true, then the capture inside failed; if they assert false, then the match fails. These assertions are zero-length (as mentioned before), which means that they will assert without consuming the characters in the input string. They will actually match empty string if the assertion passes.

Applying the knowledge above, a regex that works for your case would be:

(?=(\w\w))

score 0 · Answer 4 · edited Jun 02 '21 at 13:00

0

Am no regex expert but I would like to answer my similar question.

If you want to use a capture group with the lookahead:

example regex: (\d)(?=.\1)

string: 5252

this will match the first 5 as well as the first 2

The (\d) is to make a capture group, (?=\d\1) is to match any digit followed by the capture group 1 without consuming the string, thus allow overlapping

edited Jun 02 '21 at 13:00

logi-kal

7,107
6
31
43

answered Feb 04 '19 at 15:41

Obay Abd-Algader

1,079
12
25

How to find overlapping matches with a regexp?

4 Answers4

Linked

Related