Regex to extract nested patterns

Question

Possible Duplicate:
Matching Nested Structures With Regular Expressions in Python

I can't wrap my head around this problem. I have a string like the following one:

Lorem ipsum dolor sit amet [@a xxx yyy [@b xxx yyy [@c xxx yyy]]] lorem ipsum sit amet

My task would be to extract the commands (they are always starting with [@ and ending with ]) and their subcommands. A result like

[
    [@a xxx yyy [@b xxx yyy [@c xxx yyy]]], # the most outer
    [@b xxx yyy [@c xxx yyy]],              # the middle one
    [@c xxx yyy]                            # the inner most
]

would be highly appreciated. The problem is that these kind of commands can occur in very long text messages, so a "performant" solution would be nice.

I was toying around with some regex patterns mostly of the time something like

(\[@.*?\]\s) # for the outer one

but i have seen no light in matching the middle and inner one. To make it more complicated, the amount of nested commands is variable... Might some special regex be the solution? I have read about lookaheads and lookbehinds but no idea how to use them in this special case.

Thank a bunch!

UPDATE

@Cyborgx37 pointed me to another post that uses the pyparsing package. It would be nice to have a solution without an external package or library. But pyparsing definately solves that problem!

I doubt that `regex` will be very good with this. My impression is that it doesn't do well with nested structures ... — mgilson, Feb 05 '13 at 16:22
The pyparsing package seems ideal for this... Just waiting if someone else has maybe another good idea without using an external package. — hetsch, Feb 05 '13 at 16:34
@hetsch - If you have researched pyparsing and there's a particular feature lacking (or you don't want to use an external package), then you should consider updating your question saying so. Otherwise, it will likely be closed. — JDB, Feb 05 '13 at 16:41
@hetsch: Since this question is closed, I added an alternative solution to this problem which does not require any third-party modules [here](http://stackoverflow.com/a/14715850/190597). — unutbu, Feb 05 '13 at 20:08

score 2 · Accepted Answer · answered Feb 05 '13 at 16:45

C# has recursive/nested RegEx, I don't believe Python does. You could re-run the RegEx search on previous results, but this is probably less efficient (the overhead of RegEx for such a simple search) than just making a custom parser. The text your searching for "[@" and "]" isn't very complex.

Here's a custom parser (in JavaScript) that would do the job.

var txt = "Lorem ipsum dolor sit amet [@a xxx yyy [@b xxx yyy [@c xxx yyy]]] lorem ipsum sit amet";
function parse(s) {
    var stack = [];
    var result = [];
    for(var x=0; x<s.length; x++) {
        var c = s.charAt(x);
        if(c == '[' && x+1 < s.length-1 && s.charAt(x+1) == '@') {
            for(var y=0; y<stack.length; y++)
                stack[y] += "[@";
            stack.push("[@");
            x++;
        } else if(c == ']' && stack.length > 0) {
            for(var y=0; y<stack.length; y++)
                stack[y] += "]";
            result.push(stack.pop());
        } else {
            for(var y=0; y<stack.length; y++)
                stack[y] += c;
        }
    }
    return result;
}
parse(txt);

It quickly loops through all the characters of the text (only once) and uses a stack and an if...if else...else condition to push, pop and modify the values in that stack respectively.

+1 Wow... didn't expect anyone to actually write the parser for him/her. — JDB, Feb 05 '13 at 16:50

score 0 · Answer 2 · answered Feb 05 '13 at 16:25

0

So coming from a c# background, I'm not sure this is going to help but, I imagine that since you have to parse the inside commands anyway, why not just store the contents of the command, and then run your regex function again on the inner data? I know I'm probably missing something, but that's why I would try at least.

answered Feb 05 '13 at 16:25

sircodesalot

11,231
8
50
83

1

No... not that simple. You must make sure that you don't stop at the first `]`, because you need to consume the inner structure. But you can't just match until the last `]` because there may be multiple hierarchies back to back. For example: `[@outer1 [@inner1]] [@outer2[@inner2]]`. You must balance opening `[` with closing `]`. – JDB Feb 05 '13 at 16:28
My problem was that I couldn't identify the closing ] of the outer command. But you brought me to an idea that the most outer one is always identified by a preceded ] or any alphanum char... – hetsch Feb 05 '13 at 16:30
That's true @Cyborgx37 – hetsch Feb 05 '13 at 16:31
Ah, good point. Thanks – sircodesalot Feb 05 '13 at 16:32

score 0 · Answer 3 · answered Feb 05 '13 at 17:03

0

No wonder you cannot wrap your head around the problem. There is a formal language theory regarding formal languages. Noam Chomsky described four categories of the languages -- known as Chomsky hierarchy. Regular expressions are capable do describe the easies category of the languages -- the regular languages. However, languages with nested paired structures are outside of regular languages, and they cannot be described/accepted by regular expressions.

One kind of the parsers that are the most easily implemented are the ones based on recursive call of functions that parse elements of the language.

answered Feb 05 '13 at 17:03

pepr

20,112
15
76
139

1

It's common to confuse formal regular expressions with the tools that were inspired by them. In reality, modern regexes implement features that ["allows them to parse far more than the strictly regular languages for which they were originally named. Today, we still call these pattern-matching languages regular expressions (or regexes for short), even if they may no longer be regular in the computer scientific sense."](http://stackoverflow.com/tags/regex/info) – JDB Feb 05 '13 at 20:56
FYI: http://stackoverflow.com/q/11306641/989121 – georg Feb 05 '13 at 20:59
@Cyborgx37: Well, but then the regular expression engine is given a bit different name -- to emphasise it does something more than usual regular expression. Anyway, I did not check heavily, but Python `re` module does not support the recursive extension. – pepr Feb 06 '13 at 10:23

Regex to extract nested patterns

3 Answers3

Linked