6

Is there a way to reuse a consumed character of the source in pattern matching?

For example, suppose I want to find a pattern with regex expression (a+b+|b+a+) i.e. more than one a followed by more than one b OR vice versa.

Suppose the input is aaaabbbaaaaab

Then the output using regex would be aaaabbb and aaaaab

How can I get the output to be

aaaabbb
bbbaaaaa
aaaaab
Roman C
  • 49,761
  • 33
  • 66
  • 176
dshgna
  • 812
  • 1
  • 15
  • 34

2 Answers2

6

Try this way

String data = "aaaabbbaaaaab";
Matcher m = Pattern.compile("(?=(a+b+|b+a+))(^|(?<=a)b|(?<=b)a)").matcher(data);
while(m.find())
    System.out.println(m.group(1));

This regex uses look around mechanisms and will find (a+b+|b+a+) that

  • exists at start ^ of the input
  • starts with b that is predicted by a
  • starts with a that is predicted by b.

Output:

aaaabbb
bbbaaaaa
aaaaab

Is ^ essentially needed in this regular expression?

Yes, without ^ this regex wouldn't capture aaaabbb placed at start of input.

If I wouldn't add (^|(?<=a)b|(?<=b)a) after (?=(a+b+|b+a+)) this regex would match

aaaabbb
aaabbb
aabbb
abbb
bbbaaaaa
bbaaaaa
baaaaa
aaaaab
aaaab
aaab
aab
ab

so I needed to limit this results to only these that starts with a that has b before it (but not include b in match - so look behind was perfect for that) and b that is predicted by a.

But lets not forget about a or b that are placed at start of the string and are not predicted by anything. To include them we can use ^.


Maybe it will be easier to show this idea with this regex

(?=(a+b+|b+a+))((?<=^|a)b|(?<=^|b)a).

  • (?<=^|a)b will match b that is placed at start of string, or has a before it
  • (?<=^|b)a will match a that is placed at start of string, or has b before it
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • Thank You very much:). Its a really clear answer. I had a look over at regex buddy, but still have some confusions because I'm really new to regex. In the regex expression could you please further explain a bit of what the ^ means? Thank you again for the great answer:). – dshgna Mar 31 '13 at 09:25
  • @dgun `^` is [anchor](http://www.regular-expressions.info/anchors.html) that matches beginning of String. – Pshemo Mar 31 '13 at 09:26
  • Is ^ essentially needed in this regular expression? Why is that? (Sorry if that's stupid, I'm just curious:)) – dshgna Mar 31 '13 at 09:32
  • 1
    @dgun Check my edited answer. If there are things that you still don't understand just ask. – Pshemo Mar 31 '13 at 09:54
3

You can simulate this with lookbehind:

((?<=a)b+|(?<=b)a+)

This outputs

bbb aaaaa b
nneonneo
  • 171,345
  • 36
  • 312
  • 383