How to use regex to find all overlapping matches

Question

I'm trying to find every 10 digit series of numbers within a larger series of numbers using re in Python 2.6.

I'm easily able to grab no overlapping matches, but I want every match in the number series. Eg.

in "123456789123456789"

I should get the following list:

[1234567891,2345678912,3456789123,4567891234,5678912345,6789123456,7891234567,8912345678,9123456789]

I've found references to a "lookahead", but the examples I've seen only show pairs of numbers rather than larger groupings and I haven't been able to convert them beyond the two digits.

The presented solutions won't work when the overlapping matches start at the same point, e.g., matching "a|ab|abc" against "abcd" will only return one result. Is there a solution for that that does not involve calling match() multiple times, manually keeping track of the 'end' boundary? — Vítor De Araújo, Oct 28 '11 at 19:10
@VítorDeAraújo: overlapping regexes like `(a|ab|abc)` can generally be rewritten as non-overlapping ones with nested capture-groups, e.g. `(a(b(c)?)?)?`, where we ignore all but the outermost (i.e. leftmost) capture group when unpacking a match; admittedly this is slightly painful and less legible. This will also be a more performant regex to match. — smci, Nov 20 '17 at 02:30

score 249 · Accepted Answer · edited Feb 28 '23 at 14:14

249

Use a capturing group inside a lookahead. The lookahead captures the text you're interested in, but the actual match is technically the zero-width substring before the lookahead, so the matches are technically non-overlapping:

import re 
s = "123456789123456789"
matches = re.finditer(r'(?=(\d{10}))', s)
results = [int(match.group(1)) for match in matches]
# results: 
# [1234567891,
#  2345678912,
#  3456789123,
#  4567891234,
#  5678912345,
#  6789123456,
#  7891234567,
#  8912345678,
#  9123456789]

edited Feb 28 '23 at 14:14

Eric O. Lebigot

91,433
48
218
260

answered Apr 11 '11 at 04:58

mechanical_meat

163,903
24
228
223

3

My answer is at least 2 times faster than this one. But this solution is tricky, I upvote it. – eyquem Jul 05 '13 at 10:33
26

Explanation = instead of searching for the pattern (10 digits), it searches for anything FOLLOWED BY the pattern. So it finds position 0 of the string, position 1 of the string and so on. Then it grabs group(1) - the matching pattern and makes a list of those. VERY cool. – Tal Weiss Jul 18 '13 at 20:28
1

I had no idea you could use matching groups inside lookaheads, which normally aren't supposed to be included in a match (and the matched subgroups indeed do not appear the full match). As this technique still seems to work in Python 3.4, I guess it's considered a feature rather than a bug. – JAB Mar 27 '14 at 18:35
19

I joined StackOverflow, answered questions, and got my reputation up just so I could upvote this answer. I'm stuck with Python 2.4 for now so I can't use the more advanced regex functions of Python 3, and this is just the sort of bizarre trickery I was looking for. – TheSoundDefense Jul 07 '14 at 17:17
2

Could you add more explanation to the code. Its not the best way as per Stack Overflow, to just have code in an answer. It will definitely help people. – Akshay Hazari Sep 17 '17 at 08:12
Is there a particular reason to convert the matches to integers? Is it to generate the non-string output desired in the question? I need the same system, but not with integers. (In fact, I will be pulling the index of the match, not the data) – RufusVS Jan 24 '21 at 05:11
Just in case anyone only extracts strings, `re.findall(r'(?=(\d{10}))', s)` will do. – Wiktor Stribiżew May 19 '23 at 09:46

David C · Answer 2 · 2020-10-19T02:15:46.253

97

You can also try using the third-party regex module (not re), which supports overlapping matches.

>>> import regex as re
>>> s = "123456789123456789"
>>> matches = re.findall(r'\d{10}', s, overlapped=True)
>>> for match in matches: print(match)  # print match
...
1234567891
2345678912
3456789123
4567891234
5678912345
6789123456
7891234567
8912345678
9123456789

edited Oct 19 '20 at 02:15

answered Sep 23 '13 at 19:06

David C

7,204
5
46
65

I get: `TypeError: findall() got an unexpected keyword argument 'overlapped'` – Carsten Oct 17 '20 at 19:34
@Carsten: you first need to install the `regex` module: `pip install regex` – David C Oct 19 '20 at 01:38
2

That worked, thanks. I would have thought I'll get an import error if regex is not installed – Carsten Oct 19 '20 at 06:55

score 16 · Answer 3 · answered Jul 27 '11 at 13:34

16

I'm fond of regexes, but they are not needed here.

Simply

s =  "123456789123456789"

n = 10
li = [ s[i:i+n] for i in xrange(len(s)-n+1) ]
print '\n'.join(li)

result

answered Jul 27 '11 at 13:34

eyquem

26,771
7
38
46

12

Regexes are only not needed here because you're applying the special knowledge "within a larger series of numbers", so you already know every position `0 <= i < len(s)-n+1` is guaranteed to be the start of a 10-digit match. Also I figure your code could be sped up, would be interesting to code-golf for speed. – smci Nov 20 '17 at 02:34

score 3 · Answer 4 · answered Feb 03 '22 at 23:10

3

Piggybacking on the accepted answer, the following currently works as well

import re
s = "123456789123456789"
matches = re.findall(r'(?=(\d{10}))',s)
results = [int(match) for match in matches]

answered Feb 03 '22 at 23:10

Michael

1,537
6
20
42

score 0 · Answer 5 · answered Jun 01 '22 at 09:19

0

conventional way:

import re


S = '123456789123456789'
result = []
while len(S):
    m = re.search(r'\d{10}', S)
    if m:
        result.append(int(m.group()))
        S = S[m.start() + 1:]
    else:
        break
print(result)

answered Jun 01 '22 at 09:19

Avi Cohen

3,102
2
25
26

How to use regex to find all overlapping matches

5 Answers5

Linked

Related