Find all occurrences of a substring (including overlap)?

Question

Okay, so I found this: How to find all occurrences of a substring?

Which says, to get the indices overlapping occurances of substrings in a list, you can use:

[m.start() for m in re.finditer('(?=SUBSTRING)', 'STRING')]

Which works, but my problem is that both the string and the substring to look for are defined by variables. I don't know enough about regular expressions to know how to deal with it - I can get it to work with non-overlapping substrings, that's just:

[m.start() for m in re.finditer(p3, p1)]

Edit:

Because someone asked, I'll go ahead and specfify. p1 and p3 could be any string, but if they were, for example p3 = "tryt" and p1 = "trytryt", the result should be [0, 3].

Please post `p3`, `p1` and expected output. – Ashwini Chaudhary Sep 21 '13 at 14:24 — Ashwini Chaudhary, Sep 21 '13 at 14:24

score 7 · Accepted Answer · edited Sep 21 '13 at 15:15

7

The arguments to re.finditer are simple strings. If you have the substring in a variable simply format it into the regular expression. Something like '(?={0})'.format(p3) is a start. Since various symbols do have special meaning in a RE you will want to escape them. Luckily the re module includes re.escape for just such a need.

[m.start() for m in re.finditer('(?={0})'.format(re.escape(p3)), p1)]

edited Sep 21 '13 at 15:15

answered Sep 21 '13 at 14:29

D.Shawley

58,213
10
98
113

Didn't work, but I changed '(=?{0})' back to '(?={0})' and it worked perfectly. Thanks! – Kevin Sep 21 '13 at 14:56
sorry 'bout that but happy to be of assistance. Thanks @hop – D.Shawley Sep 21 '13 at 16:39

score 1 · Answer 2 · answered Sep 21 '13 at 16:27

1

Regex might be overkill here:

>>> word = 'tryt'
>>> text = 'trytryt'
>>> [i for i, _ in enumerate(text) if text.startswith(word, i)]
[0, 3]

answered Sep 21 '13 at 16:27

Eric

95,302
53
242
374

Veedrac · Answer 3 · 2017-09-10T14:24:30.083

0

You are doing this (or a syntactic variation):

import re

needle = "(?=(aba))"
haystack = "ababababa"

[match.start() for match in re.finditer(needle, haystack)]
#>>> [0, 2, 4, 6]

which should work.

The problem is therefore probably that needle isn't of the proper form, "(?=(...))", (this is evident from your interaction with D.Shawley). In that case, there are several options.

If your substring is a valid regex, you can iterate through the possible possitions manually, doing a match.

needle = re.compile(needle)
[i for i in range(len(haystack)) if needle.match(haystack, i)]
#>>> [0, 2, 4, 6]

If you're not wanting arbitrary Regexes but just exact substring matching, it'll be cleaner to avoid Regex entirely and go with:

needle = "aba"
haystack = "ababababa"

[i for i in range(len(haystack)) if haystack.startswith(needle, i)]
#>>> [0, 2, 4, 6]

If you're looking for a faster result, you can expand the loop and use .index to speed up the search:

def findall(needle, haystack):
    i = 0
    try:
        while True:
            i = haystack.index(needle, i)
            yield i
            i += 1

    except ValueError:
        pass

which is the fastest method I can think of.

edited Sep 10 '17 at 14:24

answered Sep 21 '13 at 14:39

Veedrac

58,273
15
112
169

completely missing the point of the question. also, very confusing. – Sep 21 '13 at 15:20
How so? The last snippet is exactly what was asked for; the first two work on the assumption that the "substring" can be defined as any regex as the question's code implied. – Veedrac Sep 21 '13 at 15:38
first you rewrite the OPs code for no apparent reason. then: the question actually asks about how to construct one string from another, which you don't address. also, you write "which should work" implying that it doesn't, but don't explain in what circumstances it doesn't. i certainly works with the regex as given, and why should that ever be changed? then you go off on a tangent about O(), which is completely irrelevant to the topic at hand. and speaking of complexity: the regex version completely blows the strcmp version out of the water for longer haystacks - just `timeit`. had enough? – Sep 21 '13 at 15:57
"first you rewrite the OPs code for no apparent reason". If that matters I can make it more similar to the original. // "the question actually asks about how to construct one string from another". Where? // "you write "which should work" implying that it doesn't, but don't explain in what circumstances it doesn't". I did, I'll clarify in an edit. // "then you go [on about] O()". That *was* relevant, though, even if D.Shawley's solution didn't need to cover it // "the regex version [is faster]". Fair enough. – Veedrac Sep 21 '13 at 16:14
+1 for the last one, which is _much_ faster than using `enumerate` or `finditer` due to being implemented in C. (Although it can be written a bit shorter, without try/except using `find`) – tobias_k Sep 10 '17 at 10:24

Find all occurrences of a substring (including overlap)?

3 Answers3

Related