3

I want to match the last group, which is enclosed in [], but may contain one of more [] inside of itself in a nested structure.

I managed, although not elegantly, to get the nested [] matching going using the regex of python. This solution works for some cases (such as s1) but not s2 or s3 when there are multiple such matches. My solution will only match the first one.

Any suggestions? A better regular expression? Or regular expression is not the way to go? Thanks a lot!

In [116]:

s1 = 'AAA [BBB [CCC]]'
s2 = 'AAA [DDD] [EEE]'
s3 = 'AAA [BBB [CCC]] [EEE]'

for s in [s1, s2, s3]:
    result = regex.search(r'(?<rec>\[(?:[^\[\]]++|(?&rec))*\])',s,flags=regex.VERBOSE)
    print(result.captures('rec'))
['[CCC]', '[BBB [CCC]]'] #I know it is perfect, but I can take the last one in the list
['[DDD]'] #This is the first one, I want the last one, which is [EEE]
['[CCC]', '[BBB [CCC]]'] #same problem as above

Edit:

Thanks a lot of the help, if I have 15 reps I will up-vote ya all. However, sorry for not including the intended result, which should be:

'AAA [BBB [CCC]]' -> '[BBB [CCC]]'
'AAA [DDD] [EEE]' -> '[EEE]'
'AAA [BBB [CCC]] [EEE]' -> '[EEE]'
'000 [[aaa] xxx [yyy [zzz ]]' -> '[[aaa] xxx [yyy [zzz ]]'
zx81
  • 41,100
  • 9
  • 89
  • 105
  • You can in fact accomplish this with look aheads/behinds however, I would likely make use of a stack/parser in this type of situation. – SystemFun Jun 11 '14 at 22:41
  • 1
    @Vlad, he has a nested structure. That may be why he has recursive regex here, I am not not sure that look around alone will do, but correct me if I am wrong. Welcome to SO, OP! – CT Zhu Jun 11 '14 at 22:44
  • what output do you want? – Padraic Cunningham Jun 11 '14 at 22:51
  • Is `[[aaa] xxx [yyy [zzz ]]` a possible situation? i.e. A mixture of different level of nested structure. – CT Zhu Jun 11 '14 at 22:54
  • FYI added tested Python code. :) – zx81 Jun 11 '14 at 23:09
  • @CTZhu, a good point I haven't though about, but `000 [[aaa] xxx [yyy [zzz ]]` should yield `[[aaa] xxx [yyy [zzz ]]`, see edit: – user3732025 Jun 11 '14 at 23:12
  • Glad it works! FYI also added regex for more complex nests like `Added regex for more complex nests like `[B[C] [D] [E[F][G]]]` :) – zx81 Jun 11 '14 at 23:28

3 Answers3

3

In Python, to use recursion or repeated subroutines, we need to use Matthew Barnett's outstanding regex module... And, as @CTZhu points out, you are already using it!

To be clear on terms, there can be several understandings of "nesting", such as:

  1. Simple nesting as in [C[D[E]F]], which is a subset of...
  2. More complex, family-style nesting as in [B[C] [D] [E[F][G]]].

You need to be able to handle the latter, and this short regex does it for us:

\[(?:[^[\]]++|(?R))*\]

This will match all the nested braces. Now all we need to do is print the last match.

Here is some tested Python code:

import regex # say "yeah!" for Matthew Barnett
pattern = r'\[(?:[^[\]]++|(?R))*\]'
myregex = regex.compile(pattern)

# this outputs [EEE]
matches = myregex.findall('AAA [BBB [CCC]] [EEE]')
print (matches[-1])

# this outputs [C[D[E]F]] (simple nesting)
matches = myregex.findall('AAA [BBB] [C[D[E]F]]')
print (matches[-1])

# this outputs [B[C] [D] [E[F][G]]] (family-style nesting)
matches = myregex.findall('AAA [AAA] [B[]B[B]] [B[C] [D] [E[F][G]]]')
print (matches[-1])
zx81
  • 41,100
  • 9
  • 89
  • 105
  • +1 for the (that I think) right answer. But you missed the fact that OP is already using `regex`, see `regex.search`? Cheers! – CT Zhu Jun 11 '14 at 23:20
  • Thanks for the prefect solution! I will study the demo you provided to understand it better. – user3732025 Jun 11 '14 at 23:27
  • @CTZhu Thanks, hadn't noticed that! Added regex for more complex nests like `[B[C] [D] [E[F][G]]]` :) – zx81 Jun 11 '14 at 23:28
  • +1, Wonderful! Thanks a lot for your help @zx81. Especially for the B solution, which I have not even considered before! And I am reading your answer that has a score of 88, very impressive! – user3732025 Jun 11 '14 at 23:33
  • @user3732025 Thanks for the great question, hope to see you again! :) – zx81 Jun 11 '14 at 23:36
  • Great solution! You do not need, however, to use `matches[len(matches)-1]` Just do `matches[-1]` for the last element, `matches[-2]` for second to last, etc. – dawg Jun 12 '14 at 00:21
  • @dawg Indeed! As you can see I'm still learning Python. Thanks very much, made the edit, that's an improvement. :) – zx81 Jun 12 '14 at 00:43
  • @user3732025 FYI improved the regex, shorter and better. :) – zx81 Jun 12 '14 at 03:32
  • This regex with the final example of `'000 [[aaa] xxx [yyy [zzz ]]'` produces `[yyy [zzz ]]` which is incomplete and unbalanced braces. – dawg Jun 12 '14 at 14:47
  • @dawg Is that a new example that wasn't there yesterday? I'm adding digits for Opening and Closing so you can see the problem with what you want. -> 'O1[O2[aaaC1] xxx O3[yyy O4[zzz C2]C3]' That's 4 opening braces and 3 closing ones. In contrast, look at the one you accuse of being incomplete and unbalanced: `O1[yyy O2[zzz C1]C2]` The pattern simply matches the last balanced brace in the whole string (the first one is `[aaa]`. We've assumed that your whole strings are balanced. If that's not the case, it's another problem (which can be solved with another similar regex). :) – zx81 Jun 12 '14 at 19:36
2

You can use this recursive regex, and just print the last match:

s1 = 'AAA [BBB [CCC]]'
s2 = 'AAA [DDD] [EEE]'
s3 = 'AAA [BBB [CCC]] [EEE]'

import regex

for e in (s1, s2, s3):
    matches=regex.findall(r'[^\[\]\s]+ | \[ (?: (?R) | [^\[\]]+ )+\]', e, regex.VERBOSE)
    print(e, '=>', matches, '=>', matches[-1])

Prints:

AAA [BBB [CCC]] => ['AAA', '[BBB [CCC]]'] => [BBB [CCC]]
AAA [DDD] [EEE] => ['AAA', '[DDD]', '[EEE]'] => [EEE]
AAA [BBB [CCC]] [EEE] => ['AAA', '[BBB [CCC]]', '[EEE]'] => [EEE]
dawg
  • 98,345
  • 23
  • 131
  • 206
  • Thanks a lot . Actually it is a little different from the intended result and @zx81 has already provided a good solution. – user3732025 Jun 11 '14 at 23:36
1

Going off your given data and you stating you want the last group, I'll provide you with this recursive regex.

import regex

s1 = 'AAA [BBB [CCC]]'
s2 = 'AAA [DDD] [EEE]'
s3 = 'AAA [BBB [CCC]] [EEE]'

for s in [s1, s2, s3]:
    result = regex.findall(r'\[(?:[^[\]]|(?R))*\]', s)
    print result[-1]

Output

[BBB [CCC]]
[EEE]
[EEE]
hwnd
  • 69,796
  • 4
  • 95
  • 132