How to regex match pairs within pairs

Question

My question is fairly straightforward, even if the purpose it will serve is pretty complicated. I will use a simple example:

AzzAyyAxxxxByyBzzB

So normally I would want to get everything between A and B. However, because some of the content between the first A and the last B (one pair) contains additional AB pairs I need to push back the end of the match. (Not sure if that last part made sense).

So what I'm looking for is some RegEx that would allow me to have the following output:

Match 1
  Group 1: AzzAyyAxxxxByyBzzB
  Group 2: zzAyyAxxxxByyBzz

Then I would match it again to get:

Match 2
  Group 1: AyyAxxxxByyB
  Group 2: yyAxxxxByy

Then finally again to get:

Match 3
  Group 1: AxxxxB
  Group 2: xxxx

Obviously if I try (A(.*?)B) on the whole input I get:

Match x
  Group 1: AzzAyyAxxxxB
  Group 2: zzAyyAxxxx

Which is not what I'm looking for :)

I hope this makes sense. I understand if this can't be done in RegEx, but I thought I would ask some of you regex wizards before I give up on it and try something else. Thanks!

Additional Info:

The project I'm working on is written in Java.

One other problem is that I'm parsing a document which could contain something like this:

AzzAyyAxxxxByyBzzB
Here is some unrelated stuff
AzzAyyAxxxxByyBzzB
AzzzBxxArrrBAssssB

And the top AB pairs needs to be separate from the bottom AB pairs

Martin Ender · Accepted Answer · 2012-11-06T23:31:43.777

2

You made your regex explicitly ungreedy by using the ?. Just leave it out and the regex will consume as much as possible before matching the B:

(A(.*)B)

However, in general nested structures are beyond the scope of regular expressions. In a case like this:

AxxxByyyAzzzB

You would now also match from the first A to the last B. If this is possible in your scenario, you might be better of going through the string yourself character-by-character and counting As and Bs to figure out which ones belong together.

EDIT:

Now that you have updated the question and we figured this out in the comments, you do have the problem of multiple consecutive pairs. In this case, this cannot be done with a regex engine that does not support recursion.

However you can switch to matching from the inside out.

A([^AB]*)B

This will only get innermost pairs, because there can be neither an A nor a B between the delimiters. If you find it, you can then remove the pair and continue with your next match.

edited Nov 06 '12 at 23:31

answered Nov 06 '12 at 23:13

Martin Ender

43,427
11
90
130

Perfect solution, unfortunately I should have mentioned another constraint. Please see the edit. Sorry! – kentcdodds Nov 06 '12 at 23:15
@kentcdodds actually I was already editing that into my answer ;). do those "parallel" pairs of `A...B` always only occur on different lines? Or might there be multiple consecutive pairs on one line? – Martin Ender Nov 06 '12 at 23:17
@kentcdodds which language or tool do you use? some regex engines support simple recursion which might help you out there. – Martin Ender Nov 06 '12 at 23:18
@kentcdodds nope, Java doesn't support it as far as I can see. how about my other question? – Martin Ender Nov 06 '12 at 23:22
@kentcdodds then this will not be possible with regex. however, you could match from the inside to the outside with `A([^AB]*)B`. This will only get innermost pairs. You could then remove them and find the next match. – Martin Ender Nov 06 '12 at 23:27
That is a perfect answer! I hadn't thought of that! Thanks! – kentcdodds Nov 06 '12 at 23:28

score 0 · Answer 2 · answered Nov 06 '12 at 23:18

0

Use word boundary if you use multiline mode:

\bA(.*)B\b  #for matches that does not start from beginning of line to end

or

^A(.*)B$    #for matches that start from beginning of line till end

answered Nov 06 '12 at 23:18

SwiftMango

15,092
13
71
136

score 0 · Answer 3 · answered Nov 06 '12 at 23:31

0

You won't be able to do this with Regular Expressions alone. What you're describing is more Context-Free than Regular. In order to parse something like this you need to push a new context onto a stack every time to encounter an 'A' and pop the stack every time you encounter a 'B'. You need something more like a pushdown automaton than a regular expression.

answered Nov 06 '12 at 23:31

Richard JP Le Guen

28,364
7
89
119

There are a few regex implementations out there which solve the given problem (e.g. PCRE). And generally almost all of the common regex engines push the meaning of "regular expression" quite a bit, by allowing back-references and lookarounds. Of course, in a theoretical sense I completely agree with you and even though there are a few regex engines which can handle this, it will certainly never be an elegant solution. But still, I though it's worth pointing out... – Martin Ender Nov 06 '12 at 23:34
@m.buettner - I'm curious: how are backreferences not regular? They seem like an implementation detail pertaining to matches and capture groups, but I don't think they affect the "regularness" of the language. – Richard JP Le Guen Nov 06 '12 at 23:41
@m.buettner - Sorry; backreferences are clearly not regular. I meant to say "how are lookarounds not regular" – Richard JP Le Guen Nov 06 '12 at 23:46
1

Oh, I guess [you're right](http://stackoverflow.com/questions/2974210/does-lookaround-affect-which-languages-can-be-matched-by-regular-expressions). – Martin Ender Nov 06 '12 at 23:55

How to regex match pairs within pairs

3 Answers3

Linked