0

After searching for a while, I could only find how to match specific subpattern repetitions. Is there a way I can find (3 or more) repetitions for any subpattern ?

For example:

re.findall(<the_regex>, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
→ ['a', 'b', 'x', 'aaabbbxxx_']

re.findall(<the_regex>, 'lalala luuluuluul')
→ ['la', 'luu', 'uul']

I apologize in advance if this is a duplicate and would be grateful to be redirected to the original question.

L. B.
  • 430
  • 3
  • 14
  • Would it be possible to get only the first match instead? I'm editing the question to reflect this case. – L. B. Jul 24 '20 at 18:11
  • Your final example suggests that you are looking for potentially overlapping patterns. This is significant enough that you should probably [edit] the question to clarify this. The examples help, but I really think some overall clarification would also be useful or even necessary. – tripleee Jul 24 '20 at 18:16

3 Answers3

2

Using this lookahead based regex you may not get exactly as you are showing in question but will get very close.

r'(?=(.+)\1\1)'

RegEx Demo

Code:

>>> reg = re.compile(r'(?=(.+)\1\1)')
>>> reg.findall('aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']
>>> reg.findall('lalala luuluuluul')
['la', 'luu', 'uul']

RegEx Details:

Since we're using a lookahead as full regex we are not really consuming character since lookahead is a zero width match. This allows us to return overlapping matches from input.

Using findall we only return capture group in our regex.

  • (?=: Start lookahead
    • (.+): Match 1 or more of any character (greedy) and capture in group #1
    • \1\1: Match 2 occurrence of group #1 using back-reference \1\1
  • ): End lookahead
anubhava
  • 761,203
  • 64
  • 569
  • 643
1

re.findall() won't find overlapping matches. But you can find the non-overlapping matches using a capture group followed by a positive lookahead that matches a back-reference to that group.

>>> import re
>>> regex = r'(.+)(?=\1{2})'
>>> re.findall(regex, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'a', 'b', 'x', 'a', 'b', 'x']
>>> re.findall(regex, 'lalala luuluuluul')
['la', 'luu']
>>> 

This will find the longest matches; if you change (.+) to (.+?) you'll get the shortest matches at each point.

>>> regex = r'(.+?)(?=\1{2})'
>>> re.findall(regex, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['a', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']
Barmar
  • 741,623
  • 53
  • 500
  • 612
1

It is not possible without defining the subpattern first.

Anyway, if the subpattern is just <any_alphanumeric>, then re.findall(<the_regex>, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_') would produce something like this :

['a', 'b', 'x', 'aa', 'ab', 'bb', 'bx', 'xx', 'x_', 'aaa', 'aaab', 'aaabb', ....]

ie, every alphanumeric combination that is repeated thrice - so a lot of combinations, not just ['a', 'b', 'x', 'aaabbbxxx_']

purpin
  • 136
  • 7