11

I'm trying to match nested {} brackets with a regular expressions in Perl so that I can extract certain pieces of text from a file. This is what I have currently:

my @matches = $str =~ /\{(?:\{.*\}|[^\{])*\}|\w+/sg;

foreach (@matches) {
    print "$_\n";
}

At certain times this works as expected. For instance, if $str = "abc {{xyz} abc} {xyz}" I obtain:

abc
{{xyz} abc}
{xyz}

as expected. But for other input strings it does not function as expected. For example, if $str = "{abc} {{xyz}} abc", the output is:

{abc} {{xyz}}
abc

which is not what I expected. I would have wanted {abc} and {{xyz}} to be on separate lines, since each is balanced on its own in terms of brackets. Is there an issue with my regular expression? If so, how would I go about fixing it?

arshajii
  • 127,459
  • 24
  • 238
  • 287
  • I believe [lookahead](http://www.regular-expressions.info/lookaround.html) could help – i-- Mar 08 '13 at 19:33

7 Answers7

19

You were surprised how your pattern matched, but noone explained it? Here's how your pattern is matching:

my @matches = $str =~ /\{(?:\{.*\}|[^{])*\}|\w+/sg;
                       ^    ^ ^ ^  ^      ^
                       |    | | |  |      |
{ ---------------------+    | | |  |      |
a --------------------------)-)-)--+      |
b --------------------------)-)-)--+      |
c --------------------------)-)-)--+      |
} --------------------------)-)-)--+      |
  --------------------------)-)-)--+      |
{ --------------------------+ | |         |
{ ----------------------------+ |         |
x ----------------------------+ |         |
y ----------------------------+ |         |
z ----------------------------+ |         |
} ------------------------------+         |
} ----------------------------------------+

As you can see, the problem is that /\{.*\}/ matches too much. What should be in there is a something that matches

(?: \s* (?: \{ ... \} | \w+ ) )*

where the ... is

(?: \s* (?: \{ ... \} | \w+ ) )*

So you need some recursion. Named groups are an easy way of doing this.

say $1
   while /
      \G \s*+ ( (?&WORD) | (?&BRACKETED) )

      (?(DEFINE)
         (?<WORD>      \s* \w+ )
         (?<BRACKETED> \s* \{ (?&TEXT)? \s* \} )
         (?<TEXT>      (?: (?&WORD) | (?&BRACKETED) )+ )
      )
   /xg;

But instead of reinventing the wheel, why not use Text::Balanced.

ikegami
  • 367,544
  • 15
  • 269
  • 518
14

The problem of matching balanced and nested delimiters is covered in perlfaq5 and I'll leave it to them to cover all the options including (?PARNO) and Regexp::Common.

But matching balanced items is tricky and prone to error, unless you really want to learn and maintain advanced regexes, leave it to a module. Fortunately there is Text::Balanced to handle this and so very much more. It is the Swiss Army Chainsaw of balanced text matching.

Unfortunately it does not handle escaping on bracketed delimiters.

use v5.10;
use strict;
use warnings;

use Text::Balanced qw(extract_multiple extract_bracketed);

my @strings = ("abc {{xyz} abc} {xyz}", "{abc} {{xyz}} abc");

for my $string (@strings) {
    say "Extracting from $string";

    # Extract all the fields, rather than one at a time.
    my @fields = extract_multiple(
        $string,
        [
            # Extract {...}
            sub { extract_bracketed($_[0], '{}') },
            # Also extract any other non whitespace
            qr/\S+/
        ],
        # Return all the fields
        undef,
        # Throw out anything which does not match
        1
    );

    say join "\n", @fields;
    print "\n";
}

You can think of extract_multiple like a more generic and powerful split.

Schwern
  • 153,029
  • 25
  • 195
  • 336
6

To match nested brackets with just one pair at each level of nesting,
but any number of levels, e.g. {1{2{3}}}, you could use

/\{[^}]*[^{]*\}|\w+/g

To match when there may be multiple pairs at any level of nesting, e.g. {1{2}{2}{2}}, you could use

/(?>\{(?:[^{}]*|(?R))*\})|\w+/g

The (?R) is used to match the whole pattern recursively.

To match the text contained within a pair of brackets the engine must match (?:[^{}]*|(?R))*,
i.e. either [^{}]* or (?R), zero or more times *.

So in e.g. "{abc {def}}", after the opening "{" is matched, the [^{}]* will match the "abc " and the (?R) will match the "{def}", then the closing "}" will be matched.

The "{def}" is matched because (?R) is simply short for the whole pattern
(?>\{(?:[^{}]*|(?R))*\})|\w+, which as we have just seen will match a "{" followed by text matching [^{}]*, followed by "}".

Atomic grouping (?>...) is used to prevent the regex engine backtracking into bracketed text once it has been matched. This is important to ensure the regex will fail fast if it cannot find a match.

MikeM
  • 13,156
  • 2
  • 34
  • 47
  • Recursion is to take care of arbitrary number of nesting levels. It is not necessary if the maximum number of nesting levels is known, but must be used otherwise. – nhahtdh Mar 08 '13 at 19:58
  • This doesn't seem to work for an input of `"{{} {}}"` (which it should match all of). – arshajii Mar 08 '13 at 20:11
  • @A.R.S. Added recursive version for such input. – MikeM Mar 08 '13 at 20:20
  • @nhahtdh. The question does not specify multiple pairs at the same level of nesting, and my first regex works fine for any number of nested levels, but not multiple pairs at the _same level_ of nesting. That was why I said that recursion was unnecessary in this case. – MikeM Mar 08 '13 at 20:39
  • Just a small case: it fails to match `{{} {}}`, since you use `+` for `[^{}]`. – nhahtdh Mar 08 '13 at 20:40
  • @MikeM: OP does not specify in the question, but the OP commented on your answer for such case. It is rare to only consider the case of pyramid nesting. Bracket matching usually have to consider multiple `{}` on the same level. (By the way, I never voted (up or down) on your answer). – nhahtdh Mar 08 '13 at 20:41
  • @nhahtdh Yes, I was referring to my earlier comment before he had done that. And good point, so I have changed `+` to `*`. – MikeM Mar 08 '13 at 20:45
5

You need a recursive regex. This should work:

my @matches;
push @matches, $1 while $str =~ /( [^{}\s]+ | ( \{ (?: [^{}]+ | (?2) )* \} ) )/xg;

or, if you prefer the non-loop version:

my @matches = $str =~ /[^{}\s]+ | \{ (?: (?R) | [^{}]+ )+ \} /gx;
Borodin
  • 126,100
  • 9
  • 70
  • 144
4

Wow. What a bunch of complicated answers to something that simple.

The problem you're having is that you're matching in greedy mode. That is, you are aking the regex engine to match as much as possible while making the expression true.

To avoid greedy match, just add a '?' after your quantifier. That makes the match as short as possible.

So, I changed your expression from:

my @matches = $str =~ /\{(?:\{.*\}|[^\{])*\}|\w+/sg;

To:

my @matches = $str =~ /\{(?:\{.*?\}|[^\{])*?\}|\w+/sg;

...and now it works exactly as you're expecting.

HTH

Francisco

Francisco Zarabozo
  • 3,676
  • 2
  • 28
  • 54
2

One way using the built-in module Text::Balanced.

Content of script.pl:

#!/usr/bin/env perl

use warnings;
use strict;
use Text::Balanced qw<extract_bracketed>;

while ( <DATA> ) { 

    ## Remove '\n' from input string.
    chomp;

    printf qq|%s\n|, $_; 
    print "=" x 20, "\n";


    ## Extract all characters just before first curly bracket.
    my @str_parts = extract_bracketed( $_, '{}', '[^{}]*' );

    if ( $str_parts[2] ) { 
        printf qq|%s\n|, $str_parts[2];
    }   

    my $str_without_prefix = "@str_parts[0,1]";


    ## Extract data of balanced curly brackets, remove leading and trailing
    ## spaces and print.
    while ( my $match = extract_bracketed( $str_without_prefix, '{}' ) ) { 
        $match =~ s/^\s+//;
        $match =~ s/\s+$//;
        printf qq|%s\n|, $match;

    }   

    print "\n";
}

__DATA__
abc {{xyz} abc} {xyz}
{abc} {{xyz}} abc

Run it like:

perl script.pl

That yields:

abc {{xyz} abc} {xyz}
====================
abc 
{{xyz} abc}
{xyz}

{abc} {{xyz}} abc
====================
{abc}
{{xyz}}
Birei
  • 35,723
  • 2
  • 77
  • 82
  • 1
    I was just about to start an answer like this! See an example in the wild here, a tiny TeX parser: https://github.com/jberger/MakeBeamerInfo/blob/master/lib/App/makebeamerinfo.pm#L230 – Joel Berger Mar 08 '13 at 20:01
1

Just modifies and extends the classic solution a bit:

(\{(?:(?1)|[^{}]*+)++\})|[^{}\s]++

Demo (This is in PCRE. The behavior is slightly different from Perl when it comes to recursive regex, but I think it should produce the same result for this case).

After some struggle (I am not familiar with Perl!), this is the demo on ideone. $& refers to the string matched by the whole regex.

my $str = "abc {{xyz} abc} {xyz} {abc} {{xyz}} abc";

while ($str =~ /(\{(?:(?1)|[^{}]*+)++\})|[^{}\s]++/g) {
    print "$&\n"
}

Note that this solution assumes that the input is valid. It will behave rather randomly on invalid input. It can be modified slightly to halt when invalid input is encountered. For that, I need more details on the input format (preferably as a grammar), such as whether abc{xyz}asd is considered valid input or not.

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162