Regex sub-substring appear k times within a larger substring of length n

Question

Suppose you have a string that consists of the concatenation of n string tokens each of length d. What is the regex pattern that would match this string if and only if n==N+2 (where N is a fixed) and a particular token appears between k and N times in it.

Problem

Suppose we have (tokens of length d=4):

s = "SbbESbbESbbEStbEStbESbbEStbESttE".

We would like to match substrings where SbtE appears exactly 2, 3, or 4 times (k=2) between a delimiting SbbE token and a delimiting token that's not SbtE, the delimiting tokens themselves 4 tokens apart (N=4).

Desired solution

Adding spaces inside s for readability s ~ "SbbE SbbE SbbE StbE StbE SbbE StbE SttE". Here, we have a total of 8 tokens, but we are looking for substrings containing exactly 4 tokens, which in turn contain between 2 and 4 appearances of SbtE. So, the first match should be from token 1 to token 6, and the second match should be from token 3 to token 8.

Is this even possible using regular expressions?

Nick · Answer 1 · 2022-07-07T02:50:46.003

Although this is doable with regex, it would probably be simpler to just split the string into tokens and process them. For example:

d, k, N = 4, 2, 4
s = "SbbESbbESbbEStbEStbESbbEStbESttE"
tokens = [s[i:i+d] for i in range(0, len(s), d)]
matches = []
for i in range(len(tokens)-N-2+1):
    if (tokens[i] == 'SbbE' and 
        tokens[i+1:i+N+1].count('StbE') >= k and 
        tokens[i+N+1] != 'StbE'):
        matches.append(''.join(tokens[i:i+N+2]))

print(matches)

Output:

['SbbESbbESbbEStbEStbESbbE', 'SbbEStbEStbESbbEStbESttE']

If you are committed to using regex, you can use the following regex to match your requirements (for the specific d, k, N = 4, 2, 4 case):

SbbE(?=(?:.{4}){0,2}(?:StbE){2}|(?:.{4})?StbE.{4}StbE|StbE.{8}StbE).{16}(?!StbE).{4}

This looks for:

Sbbe : the string Sbbe
(?= : a positive lookahead assertion for 3 alternatives:
(?:.{4}){0,2}(?:StbE){2} : StbEStbE preceded by 0, 4, or 8 characters; or
(?:.{4})?StbE.{4}StbE : StbE....StbE preceded by 0 or 4 characters; or
StbE.{8}StbE : StbE........StbE
) : end of assertion
.{16} : match 16 characters
(?!StbE) : a negative lookahead that says the next token is not StbE
.{4} : 4 characters

Note that since you want 2 to 4 tokens in a substring of 4 tokens, you only need to check that at least 2 are present (hence the 3 alternatives in the positive lookahead).

Now since you want to find overlapping matches, we wrap the entire regex in a positive lookahead (as described in this answer):

(?=(SbbE(?=(?:.{4}){0,2}(?:StbE){2}|(?:.{4})?StbE.{4}StbE|StbE.{8}StbE).{16}(?!StbE).{4}))

Demo on regex101

In python:

s = "SbbESbbESbbEStbEStbESbbEStbESttE"
re.findall(r'(?=(SbbE(?=(?:.{4}){0,2}(?:StbE){2}|(?:.{4})?StbE.{4}StbE|StbE.{8}StbE).{16}(?!StbE).{4}))', s)

Output:

['SbbESbbESbbEStbEStbESbbE', 'SbbEStbEStbESbbEStbESttE']

Thank you @Nick. This beautifully solves the specific example I gave, but I was looking for a generic way to match `k` letters/tokens within a substring of length `n` (where `k` and `n` are not fixed; i.e., a pattern that doesn't need to be tailored to each `(k,n)` pair). I imagine that's not possible with regex? Anyway, the `for` loop is a practical workaround. — Sia, Jul 08 '22 at 18:19
@Sia unfortunately the issue with trying to do a generic regex is that you can't stop the lookaheads for `StbE` looking beyond the end of the match, thus giving you false matches e.g. https://regex101.com/r/oPXCED/1. The only way to avoid that would be to iterate over the string in chunks of d*(N+2) characters first, in which case you might as well use my non-regex solution as it will be much faster. — Nick, Jul 08 '22 at 22:29

Regex sub-substring appear k times within a larger substring of length n

Problem

Desired solution

1 Answers1