finditer and findall jumping over substrings

Question

I am trying to find all the occurrences of e.g. p='gg' inside s='ggggg'. Based on my count there should be 4 since any position apart from the last one is a substring. For example s[1:2] is 'gg'. However, trying both:

>re.findall('gg','ggggg')
['gg','gg']
>list(re.finditer('gg','ggggg'))
[<_sre.SRE_Match object; span=(0, 2), match='ab'>,
 <_sre.SRE_Match object; span=(6, 8), match='gg'>,
 <_sre.SRE_Match object; span=(8, 10), match='gg'>]

Seems to be skipping over potential matches once it finds some match. Also, as a result, the search for e.g. 'star' or 'start' is equivalent to just looking for start, since I would never find the second because the first is its prefix...

Is this a bug? How can I perform a full substring search ?

example 2:

>re.findall('star|start','starting')
['star']
>list(re.finditer('star|start','starting'))
[<_sre.SRE_Match object; span=(0, 4), match='star'>]

(I am using Python 3, re version 2.2.1)

score 3 · Accepted Answer · answered Feb 15 '21 at 13:44

import re
re.findall('gg','ggggg')

result in 2 matches as re.findall does not look for overlapping matches, or as re docs says

Return all non-overlapping matches of pattern in string, as a list of strings.

So this is not a bug, but behavior compliant with documentation.

If you are allowed to use external modules you might harness regex following way:

import regex
print(re.findall('gg', 'ggggg', overlapped=True))

output:

['gg', 'gg', 'gg', 'gg']

score 1 · Answer 2 · answered Feb 15 '21 at 13:44

The keyword you might searching for is "overlapping". Here is a linked question String count with overlapping occurrences.

From the re documentation.

find_all : Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

It seem to be a feature not a bug.

You can implement you own search function like:

def my_search(s, substr):
    for i in len(s):
        if s[i:].startswith(substr):
            yield i

finditer and findall jumping over substrings

2 Answers2