Match all elements with n occurrences

Question

I want to select the same element with exact n occurrences.

Match letters that repeats exact 3 times in this String: "aaaaabbbcccccccccdddee"

this should return "bbb" and "ddd"

If I define what I should match like "b{3}" or "d{3}", this would be easier, but I want to match all elements

I've tried and the closest I came up is this regex: (.)\1{2}(?!\1) Which returns "aaa", "bbb", "ccc", "ddd"

And I can't add negative lookbehind, because of "non-fixed width" (?<!\1)

Please add a tag that identifies the language are you using (as different languages support different regular expression features and formats). — Cary Swoveland, Mar 29 '22 at 04:22

Nick · Answer 1 · 2022-03-29T06:03:06.903

4

One possibility is to use a regex that looks for a character which is not followed by itself (or beginning of line), followed by three identical characters, followed by another character which is not the same as the second three i.e.

(?:(.)(?!\1)|^)((.)\3{2})(?!\3)

Demo on regex101

The match is captured in group 2. The issue with this though is that it absorbs a character prior to the match, so cannot find adjacent matches: as shown in the demo, it only matches aaa, ccc and eee in aaabbbcccdddeee.

This issue can be resolved by making the entire regex a lookahead, a technique which allows for capturing overlapping matches as described in this question. So:

(?=(?:(.)(?!\1)|^)((.)\3{2})(?!\3))

Again, the match is captured in group 2.

Demo on regex101

edited Mar 29 '22 at 06:03

answered Mar 29 '22 at 05:31

Nick

138,499
22
57
95

Just get it to group 2, why didn't I though of this. – Luka Mar 29 '22 at 06:43
1

@Luka sometimes you just need a fresh set of eyes looking at the problem – Nick Mar 29 '22 at 06:55

Chris Maurer · Answer 2 · 2022-04-09T05:48:13.620

This gets sticky because you cannot put a back reference inside a negative character set, so we'll use a lookbehind followed by a negative lookahead like this:

(?<=(.))((?!\1).)\2\2(?!\2))

This says find a character but don't include it in the match. Then look ahead to be certain the next character is different. Next consume it into capture group 2 and be certain that the next two characters match it, and the one after does not match.

Unfortunately, this does not work on 3 characters at the beginning of the string. I had to add a whole alternation clause to handle that case. So the final regex is:

(?:(?<=(.))((?!\1).)\2\2(?!\2))|^(.)\3\3(?!\3)

This handles all cases.

EDIT

I found a way to handle matches at the beginning of the string:

(?:(?<=(.))|^)((?!\1).)\2\2(?!\2)

Much nicer and more compact, and does not require looking in capture groups to get the answer.

The fourth bird · Accepted Answer · 2022-03-29T08:40:38.017

You could match what you don't want to keep, which is 4 or more times the same character.

Then use an alternation to capture what you want to keep, which is 3 times the same character.

The desired matches are in capture group 2.

(.)\1{3,}|((.)\3\3)

(.) Capture group 1, match a single character
\1{3,} Repeat the same char in group 1, 3 or more times
| Or
( Capture group 2
- (.)\3\3 Capture group 3, match a single character followed by 2 backreferences matching 2 times the same character as in group 3
) Close group 2

Regex demo

score 1 · Answer 4 · answered Apr 01 '22 at 04:49

If your environment permits the use of (*SKIP)(*FAIL), you can manage to return a lean set of matches by consuming substrings of four or more consecutive duplicate characters then discard them. In the alternation, match the desired 3 consecutive duplicated characters.

PHP Code: (Demo)

$string = 'aaaaabbbcccccccccdddee';
var_export(
    preg_match_all(
        '/(?:(.)\1{3,}(*SKIP)(*F)|(.)\2{2})/',
        $string,
        $m
    )
    ? $m[0]
    : 'no matches'
);

Output:

array (
  0 => 'bbb',
  1 => 'ddd',
)

This technique uses no lookarounds and does not generate false positive matches in the matches array (which would otherwise need to be filtered out).

This pattern is efficient because it never needs to look backward and by consuming the 4 or more consecutive duplicates, it can rule-out long substrings quickly.

Match all elements with n occurrences

4 Answers4