0

With regex only, how to match an exact number of consecutive repetitions of an arbitrary single token? For example, matching the "aaa" in “ttaaabbb” instead of the "aaaa" “ttaaaabbb”, given the desired number of repetitions is 3.

Clarification: Note I was using "a" for an example, the token can be arbitrary character/number/symbols. That is, given the desired number of repetitions is 3, the desired match of "aaaa**!!!cccc333**" only gives "!!!" and "333".

In short, I want to find a list of tokens "X" where YXXXY appeared in the given string (Y is some other tokens that are different from X, Y can also be the start of the string or the end of the string). Note there can be repeated tokens in the list, e.g., "aaabbbbaaa" should give ["a", "a"].

Some other examples:

Input: "aaabbbbbbaaa****ccc", output: ["a", "a", "c"] Input: "!!! aaaabbbaaa ccc!!!", output: ["!", "b", "a", "c", "!"].

What I have tried: I tried (.)\1{2} but unfortunately, it matches "aaaa" and "ccccc" as well in the example above. I further changed it to (?!\1)(.)\1{2}(?!\1) such that the prefix and postfix of the repeating pattern differ from it. However, I got an error in this case since the first \1 is undefined when being referred to.

puyuan
  • 3
  • 2
  • The previous description was ambiguous, I added more description. – puyuan Dec 01 '21 at 01:17
  • Do you have to use only regex ? – Chiheb Nexus Dec 01 '21 at 01:39
  • 2
    yeah...otherwise it can be easily solved by loops. – puyuan Dec 01 '21 at 01:41
  • Is it a requirement that the match must be the whole match (instead of a group)? – user202729 Dec 01 '21 at 02:12
  • I am not sure what do you mean by the whole match... but I need a list of tokens that satisfy the pattern I described. For example, given the input "aaabbbbaaaccc", the output should be something like ["a", "a", "c"]. – puyuan Dec 01 '21 at 02:22
  • When you address comments asking for clarification you should edit the question rather than elaborate in comments. Questions should be self-contained, in part because not all readers read all comments. In any event, your question is still not clear (to me, anyway). Do you wish to determine whether *every* unique character in a string only appears in consecutive groups of (say) 3 characters? Alternatively, do you wish to determine whether a *specified* character appears at least once in the string and wherever it appears, it appears in consecutive groups of (say) 3? – Cary Swoveland Dec 01 '21 at 03:18
  • I don't believe you can use a regular expression directly for this. Suppose the string were `"aaaabbbcccdddbbbccdddee"`. Matches against the regular expression `(.)\1*` produce an array ("list" in Python?) `['aaaa', 'bbb', 'ccc', 'ddd', 'bbb', 'cc', 'ddd', 'ee']`. In Python you should be able to easily create the hash `"{"a"=>["aaaa"], "b"=>["bbb", "bbb"], "c"=>["ccc", "cc"], "d"=>["ddd", "ddd"], "e"=>["ee"]}`. You then want to keep those keys whose values contain three-character strings only, namely, `['b', 'd']`. – Cary Swoveland Dec 01 '21 at 07:14

2 Answers2

1

You might use a pattern with 2 capture groups and a repeated backreference.

First match 4 or more times the same repeated character that you want to avoid, then match 3 times the same character.

The single characters that you want are in capture group 2, which you can get using re.finditer for example.

(\S)\1{3,}|(\S)\2{2}

The pattern matches:

  • (\S)\1{3,} Capture group 1, match a non whitespace char and repeat the backreference 3 or more times
  • | Or
  • (\S)\2{2} Capture group 2, match a non whitespace char and repeat the backreference 2 times

Regex demo | Python demo

For example:

import re

strings = [
    "aaaa**!!!cccc333**",
    "aaabbbbaaa",
    "aaabbbbbbaaa****ccc",
    "!!! aaaabbbaaa ccc!!!"
]
pattern = r"(\S)\1{3,}|(\S)\2{2}"
for s in strings:
    matches = re.finditer(pattern, s)
    result = []
    for matchNum, match in enumerate(matches, start=1):
        if match.group(2):
            result.append(match.group(2))
    print(result)

Output

['!', '3']
['a', 'a']
['a', 'a', 'c']
['!', 'b', 'a', 'c', '!']
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

You can do something like this using a regex and a loop:

def exact_re_match(string, length):
    regex = re.compile(r'(.)\1*')
    for match in regex.finditer(string):
        elm = match.group()
        if len(elm) == length:
            yield elm

string = "aaaa!!!cccc333"
out = list(exact_re_match(string, 3))
print(out)
# ['!!!', '333']
Chiheb Nexus
  • 9,104
  • 4
  • 30
  • 43