Perl regular expression: match nested brackets

Question

I'm trying to match nested {} brackets with a regular expressions in Perl so that I can extract certain pieces of text from a file. This is what I have currently:

my @matches = $str =~ /\{(?:\{.*\}|[^\{])*\}|\w+/sg;

foreach (@matches) {
    print "$_\n";
}

At certain times this works as expected. For instance, if $str = "abc {{xyz} abc} {xyz}" I obtain:

abc
{{xyz} abc}
{xyz}

as expected. But for other input strings it does not function as expected. For example, if $str = "{abc} {{xyz}} abc", the output is:

{abc} {{xyz}}
abc

which is not what I expected. I would have wanted {abc} and {{xyz}} to be on separate lines, since each is balanced on its own in terms of brackets. Is there an issue with my regular expression? If so, how would I go about fixing it?

I believe [lookahead](http://www.regular-expressions.info/lookaround.html) could help — i--, Mar 08 '13 at 19:33

ikegami · Accepted Answer · 2013-11-06T20:39:19.730

19

You were surprised how your pattern matched, but noone explained it? Here's how your pattern is matching:

my @matches = $str =~ /\{(?:\{.*\}|[^{])*\}|\w+/sg;
                       ^    ^ ^ ^  ^      ^
                       |    | | |  |      |
{ ---------------------+    | | |  |      |
a --------------------------)-)-)--+      |
b --------------------------)-)-)--+      |
c --------------------------)-)-)--+      |
} --------------------------)-)-)--+      |
  --------------------------)-)-)--+      |
{ --------------------------+ | |         |
{ ----------------------------+ |         |
x ----------------------------+ |         |
y ----------------------------+ |         |
z ----------------------------+ |         |
} ------------------------------+         |
} ----------------------------------------+

As you can see, the problem is that /\{.*\}/ matches too much. What should be in there is a something that matches

(?: \s* (?: \{ ... \} | \w+ ) )*

where the ... is

(?: \s* (?: \{ ... \} | \w+ ) )*

So you need some recursion. Named groups are an easy way of doing this.

say $1
   while /
      \G \s*+ ( (?&WORD) | (?&BRACKETED) )

      (?(DEFINE)
         (?<WORD>      \s* \w+ )
         (?<BRACKETED> \s* \{ (?&TEXT)? \s* \} )
         (?<TEXT>      (?: (?&WORD) | (?&BRACKETED) )+ )
      )
   /xg;

But instead of reinventing the wheel, why not use Text::Balanced.

edited Nov 06 '13 at 20:39

answered Mar 08 '13 at 20:57

ikegami

367,544
15
269
518

1

wow how did you make that first ascii image? did you use a tool? – Alex Gordon Mar 11 '13 at 20:16
5

@Артём Царионов, I did it by hand. – ikegami Mar 11 '13 at 20:27
1

@Артём Царионов, nah, that was easy. It took far more time to write the code at the bottom. You might like [YAPE::Regex::Explain](http://search.cpan.org/perldoc?YAPE::Regex::Explain) – ikegami Mar 11 '13 at 20:29
@Артём Царионов, Actually, there was a line missing in the diagram. Fixed. – ikegami Mar 11 '13 at 20:33
i figured that may have been the issue! – Alex Gordon Mar 11 '13 at 20:47
@ikegami Can you explain the last regex please? Or tell me how to use it. Thx! – EverythingRightPlace Dec 09 '13 at 12:51
It matches strings like `abc {{xyz} abc} {xyz}` like the OP asked. How it does so it quite readable. // I didn't just give the regex; I included code that uses it. – ikegami Dec 09 '13 at 13:03
@ikegami But where is the string read? I just know the `$str =~ //` style. – EverythingRightPlace Dec 09 '13 at 13:20
2

@bashophil, It defaults to `$_`. Feel free to match against `$str` if you wish. – ikegami Dec 09 '13 at 14:30
@ikegami Thanks for this. You could base a book on this type of explanation :-). – Gev Oct 29 '21 at 06:24

Schwern · Answer 2 · 2013-03-08T20:10:23.447

The problem of matching balanced and nested delimiters is covered in perlfaq5 and I'll leave it to them to cover all the options including (?PARNO) and Regexp::Common.

But matching balanced items is tricky and prone to error, unless you really want to learn and maintain advanced regexes, leave it to a module. Fortunately there is Text::Balanced to handle this and so very much more. It is the Swiss Army Chainsaw of balanced text matching.

Unfortunately it does not handle escaping on bracketed delimiters.

use v5.10;
use strict;
use warnings;

use Text::Balanced qw(extract_multiple extract_bracketed);

my @strings = ("abc {{xyz} abc} {xyz}", "{abc} {{xyz}} abc");

for my $string (@strings) {
    say "Extracting from $string";

    # Extract all the fields, rather than one at a time.
    my @fields = extract_multiple(
        $string,
        [
            # Extract {...}
            sub { extract_bracketed($_[0], '{}') },
            # Also extract any other non whitespace
            qr/\S+/
        ],
        # Return all the fields
        undef,
        # Throw out anything which does not match
        1
    );

    say join "\n", @fields;
    print "\n";
}

You can think of extract_multiple like a more generic and powerful split.

MikeM · Answer 3 · 2013-03-11T20:02:52.480

6

To match nested brackets with just one pair at each level of nesting,
but any number of levels, e.g. {1{2{3}}}, you could use

/\{[^}]*[^{]*\}|\w+/g

To match when there may be multiple pairs at any level of nesting, e.g. {1{2}{2}{2}}, you could use

/(?>\{(?:[^{}]*|(?R))*\})|\w+/g

The (?R) is used to match the whole pattern recursively.

To match the text contained within a pair of brackets the engine must match (?:[^{}]*|(?R))*,
i.e. either [^{}]* or (?R), zero or more times *.

So in e.g. "{abc {def}}", after the opening "{" is matched, the [^{}]* will match the "abc " and the (?R) will match the "{def}", then the closing "}" will be matched.

The "{def}" is matched because (?R) is simply short for the whole pattern
(?>\{(?:[^{}]*|(?R))*\})|\w+, which as we have just seen will match a "{" followed by text matching [^{}]*, followed by "}".

Atomic grouping (?>...) is used to prevent the regex engine backtracking into bracketed text once it has been matched. This is important to ensure the regex will fail fast if it cannot find a match.

edited Mar 11 '13 at 20:02

answered Mar 08 '13 at 19:39

MikeM

13,156
2
34
47

Recursion is to take care of arbitrary number of nesting levels. It is not necessary if the maximum number of nesting levels is known, but must be used otherwise. – nhahtdh Mar 08 '13 at 19:58
This doesn't seem to work for an input of `"{{} {}}"` (which it should match all of). – arshajii Mar 08 '13 at 20:11
@A.R.S. Added recursive version for such input. – MikeM Mar 08 '13 at 20:20
@nhahtdh. The question does not specify multiple pairs at the same level of nesting, and my first regex works fine for any number of nested levels, but not multiple pairs at the _same level_ of nesting. That was why I said that recursion was unnecessary in this case. – MikeM Mar 08 '13 at 20:39
Just a small case: it fails to match `{{} {}}`, since you use `+` for `[^{}]`. – nhahtdh Mar 08 '13 at 20:40
@MikeM: OP does not specify in the question, but the OP commented on your answer for such case. It is rare to only consider the case of pyramid nesting. Bracket matching usually have to consider multiple `{}` on the same level. (By the way, I never voted (up or down) on your answer). – nhahtdh Mar 08 '13 at 20:41
@nhahtdh Yes, I was referring to my earlier comment before he had done that. And good point, so I have changed `+` to `*`. – MikeM Mar 08 '13 at 20:45

Borodin · Answer 4 · 2013-03-08T20:47:00.553

5

You need a recursive regex. This should work:

my @matches;
push @matches, $1 while $str =~ /( [^{}\s]+ | ( \{ (?: [^{}]+ | (?2) )* \} ) )/xg;

or, if you prefer the non-loop version:

my @matches = $str =~ /[^{}\s]+ | \{ (?: (?R) | [^{}]+ )+ \} /gx;

edited Mar 08 '13 at 20:47

answered Mar 08 '13 at 19:39

Borodin

126,100
9
70
144

Thanks for your answer. I also had a `\w+` that I wanted to match, so can I just append `|\w+` on to the end? – arshajii Mar 08 '13 at 19:49
I've fixed it to allow for that – Borodin Mar 08 '13 at 19:56
This seems to be duplicating all of the matches, strangely. – arshajii Mar 08 '13 at 20:22
@A.R.S.: You will need a loop like in my answer. – nhahtdh Mar 08 '13 at 20:25
@nhahtdh. No loop required - see my answer. – MikeM Mar 08 '13 at 20:28
1

@A.R.S.: The problem is that a regex in list context will return the contents of all the captures. It is only a concise way of writing a `while` loop. I have put my regex in context so that you can see how it would work – Borodin Mar 08 '13 at 20:34
@A.R.S.: I've added a version that doesn't use captures in case you prefer it without a loop. – Borodin Mar 08 '13 at 20:47
There is nothing duplicated. Do you mean the two character classes? They are different. – Borodin Mar 08 '13 at 21:04

score 4 · Answer 5 · answered Mar 15 '13 at 13:15

Wow. What a bunch of complicated answers to something that simple.

The problem you're having is that you're matching in greedy mode. That is, you are aking the regex engine to match as much as possible while making the expression true.

To avoid greedy match, just add a '?' after your quantifier. That makes the match as short as possible.

So, I changed your expression from:

my @matches = $str =~ /\{(?:\{.*\}|[^\{])*\}|\w+/sg;

To:

my @matches = $str =~ /\{(?:\{.*?\}|[^\{])*?\}|\w+/sg;

...and now it works exactly as you're expecting.

HTH

Francisco

score 2 · Answer 6 · answered Mar 08 '13 at 20:00

One way using the built-in module Text::Balanced.

Content of script.pl:

#!/usr/bin/env perl

use warnings;
use strict;
use Text::Balanced qw<extract_bracketed>;

while ( <DATA> ) { 

    ## Remove '\n' from input string.
    chomp;

    printf qq|%s\n|, $_; 
    print "=" x 20, "\n";


    ## Extract all characters just before first curly bracket.
    my @str_parts = extract_bracketed( $_, '{}', '[^{}]*' );

    if ( $str_parts[2] ) { 
        printf qq|%s\n|, $str_parts[2];
    }   

    my $str_without_prefix = "@str_parts[0,1]";


    ## Extract data of balanced curly brackets, remove leading and trailing
    ## spaces and print.
    while ( my $match = extract_bracketed( $str_without_prefix, '{}' ) ) { 
        $match =~ s/^\s+//;
        $match =~ s/\s+$//;
        printf qq|%s\n|, $match;

    }   

    print "\n";
}

__DATA__
abc {{xyz} abc} {xyz}
{abc} {{xyz}} abc

Run it like:

perl script.pl

That yields:

abc {{xyz} abc} {xyz}
====================
abc 
{{xyz} abc}
{xyz}

{abc} {{xyz}} abc
====================
{abc}
{{xyz}}

I was just about to start an answer like this! See an example in the wild here, a tiny TeX parser: https://github.com/jberger/MakeBeamerInfo/blob/master/lib/App/makebeamerinfo.pm#L230 — Joel Berger, Mar 08 '13 at 20:01

score 1 · Answer 7 · edited May 23 '17 at 12:32

Just modifies and extends the classic solution a bit:

(\{(?:(?1)|[^{}]*+)++\})|[^{}\s]++

Demo (This is in PCRE. The behavior is slightly different from Perl when it comes to recursive regex, but I think it should produce the same result for this case).

After some struggle (I am not familiar with Perl!), this is the demo on ideone. $& refers to the string matched by the whole regex.

my $str = "abc {{xyz} abc} {xyz} {abc} {{xyz}} abc";

while ($str =~ /(\{(?:(?1)|[^{}]*+)++\})|[^{}\s]++/g) {
    print "$&\n"
}

Note that this solution assumes that the input is valid. It will behave rather randomly on invalid input. It can be modified slightly to halt when invalid input is encountered. For that, I need more details on the input format (preferably as a grammar), such as whether abc{xyz}asd is considered valid input or not.

@Borodin: Some optimization to prevent backtracking. It assumes the string is valid, though. — nhahtdh, Mar 08 '13 at 19:54

Perl regular expression: match nested brackets

7 Answers7

Linked

Related