4

I need a regular expression that will match groups of characters in a string. Here's an example string:

qwwwwwwwwweeeeerrtyyyyyqqqqwEErTTT

It should match

(match group) "result"

(1) "q"

(2) "wwwwwwwww"

(3) "eeeee"

(4) "rr"

(5) "t"

(6) "yyyyy"

(7) "qqqq"

(8) "w"

(9) "EE"

(10) "r"

(11) "TTT"

after doing some research, this is the best I could come up with

/(.)(\1*)/g

The problem I'm having is that the only way to use the \1 back-reference is to capture the character first. If I could reference the result of a non capturing group I could solve this problem but after researching I don't think it's possible.

user2936448
  • 335
  • 4
  • 16

4 Answers4

4

How about /((.)(\2*))/g? That way, you match the group as a whole (I'm assuming that that's what you want, and that's what's lacking from the solution you found).

SQB
  • 3,926
  • 2
  • 28
  • 49
3

Looks like you need to use a Matcher in a loop:

Pattern p = Pattern.compile("((.)\\2*)");
Matcher m = p.matcher("qwwwwwwwwweeeeerrtyyyyyqqqqwEErTTT");
while (m.find()) {
    System.out.println(m.group(1));
}

Outputs:

q
wwwwwwwww
eeeee
rr
t
yyyyy
qqqq
w
EE
r
TTT
willkil
  • 1,619
  • 1
  • 21
  • 33
1

Assuming what @cruncher said as a premise is true: "we want to catch repeating letter groups without knowing beforehand which letter should be repeating" then:

/((a*?+)|(b*?+)|(c*?+)|(d*?+)|(e*?+)|(f*?+)|(g*?+)|(h*?+))/

The above RegEx should allow the capture of repeating letter groups without hardcoding a particular order in which they would occur.

The ?+ is a reluctant possesive quantifier which helps us not waste RAM space by not saving previously valid backtracking cases if the current case is valid.

Mihai Stancu
  • 15,848
  • 2
  • 33
  • 51
  • The problem with this, is that the regex grows with the size of your input domain. Which for just letters is 26*2=52. Actually it's worse than this. I just realised your regex forces a specific order. – Cruncher Oct 30 '13 at 13:23
  • The last one does not force a speciffic order because of the `|` logical `OR` operator. – Mihai Stancu Oct 30 '13 at 13:32
  • Reluctant possessive quantifiers? That's a new one on me! A quantifier can be either reluctant or possessive, not both. In this case it's the possessive kind you want (`a*+`, `b*+`, etc.). – Alan Moore Oct 30 '13 at 14:24
  • Isn't `\w*` greedy, `\w*?` reluctant, `\w*+` posessive and `\w*?+` reluctant posessive, as in it is posessive of every step it matches but it doesn't match all repetitions from the first step, it waits for the following groups to finish their own matching? – Mihai Stancu Oct 30 '13 at 14:29
  • Doesn't the `?` in this `\w*?+` stand for reluctance? Or does it stand for possible (may or may not be) meaning that `\w*+` is equivalent to `\w*?+` because `*` means zero or more repetitions and `?` means possible. – Mihai Stancu Oct 30 '13 at 14:32
0

Since you did tag java, I'll give an alternative non-regex solution(I believe in requirements being the end product, not the method by which you get there).

String repeat = "";
char c = '';
for(int i = 0 ; i < s.length() ; i++) {
    if(s.charAt(i) == c) {
        repeat += c;
    } else {
        if(!repeat.isEmpty()) 
            doSomething(repeat); //add to an array if you want
        c = s.charAt(i);
        repeat = "" + c;
    }
}
if(!repeat.isEmpty())
    doSomething(repeat);
Cruncher
  • 7,641
  • 1
  • 31
  • 65