Regex to match all repeating alphanumerical subpatterns

Question

After searching for a while, I could only find how to match specific subpattern repetitions. Is there a way I can find (3 or more) repetitions for any subpattern ?

For example:

re.findall(<the_regex>, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
→ ['a', 'b', 'x', 'aaabbbxxx_']

re.findall(<the_regex>, 'lalala luuluuluul')
→ ['la', 'luu', 'uul']

I apologize in advance if this is a duplicate and would be grateful to be redirected to the original question.

Would it be possible to get only the first match instead? I'm editing the question to reflect this case. — L. B., Jul 24 '20 at 18:11
Your final example suggests that you are looking for potentially overlapping patterns. This is significant enough that you should probably [edit] the question to clarify this. The examples help, but I really think some overall clarification would also be useful or even necessary. — tripleee, Jul 24 '20 at 18:16

anubhava · Accepted Answer · 2020-07-24T18:19:59.917

Using this lookahead based regex you may not get exactly as you are showing in question but will get very close.

r'(?=(.+)\1\1)'

RegEx Demo

Code:

>>> reg = re.compile(r'(?=(.+)\1\1)')
>>> reg.findall('aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']
>>> reg.findall('lalala luuluuluul')
['la', 'luu', 'uul']

RegEx Details:

Since we're using a lookahead as full regex we are not really consuming character since lookahead is a zero width match. This allows us to return overlapping matches from input.

Using findall we only return capture group in our regex.

(?=: Start lookahead
- (.+): Match 1 or more of any character (greedy) and capture in group #1
- \1\1: Match 2 occurrence of group #1 using back-reference \1\1
): End lookahead

score 1 · Answer 2 · answered Jul 24 '20 at 18:16

re.findall() won't find overlapping matches. But you can find the non-overlapping matches using a capture group followed by a positive lookahead that matches a back-reference to that group.

>>> import re
>>> regex = r'(.+)(?=\1{2})'
>>> re.findall(regex, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'a', 'b', 'x', 'a', 'b', 'x']
>>> re.findall(regex, 'lalala luuluuluul')
['la', 'luu']
>>>

This will find the longest matches; if you change (.+) to (.+?) you'll get the shortest matches at each point.

>>> regex = r'(.+?)(?=\1{2})'
>>> re.findall(regex, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['a', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']

purpin · Answer 3 · 2020-07-24T18:25:11.277

It is not possible without defining the subpattern first.

Anyway, if the subpattern is just <any_alphanumeric>, then re.findall(<the_regex>, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_') would produce something like this :

['a', 'b', 'x', 'aa', 'ab', 'bb', 'bx', 'xx', 'x_', 'aaa', 'aaab', 'aaabb', ....]

ie, every alphanumeric combination that is repeated thrice - so a lot of combinations, not just ['a', 'b', 'x', 'aaabbbxxx_']

Regex to match all repeating alphanumerical subpatterns

3 Answers3