6

I have an expression which I need to split and store in an array:

aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"

It should look like this once split and stored in the array:

aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }
aaa="bbb{}" { aa="b}b" }
aaa="bbb,ccc"

I use Perl version 5.8 and could someone resolve this?

brian d foy
  • 129,424
  • 31
  • 207
  • 592
meharo
  • 121
  • 3
  • 7

6 Answers6

11

Use the perl module "Regexp::Common". It has a nice balanced parenthesis Regex that works well.

# ASN.1
use Regexp::Common;
$bp = $RE{balanced}{-parens=>'{}'};
@genes = $l =~ /($bp)/g;
Erik Aronesty
  • 11,620
  • 5
  • 64
  • 44
10

There's an example in perlre, using the recursive regex features introduced in v5.10. Although you are limited to v5.8, other people coming to this question should get the right solution :)

$re = qr{ 
            (                                # paren group 1 (full function)
                foo
                (                            # paren group 2 (parens)
                    \(
                        (                    # paren group 3 (contents of parens)
                            (?:
                                (?> [^()]+ ) # Non-parens without backtracking
                                |
                                (?2)         # Recurse to start of paren group 2
                            )*
                        )
                    \)
                )
            )
    }x;
brian d foy
  • 129,424
  • 31
  • 207
  • 592
1

To match balanced parenthesis or curly brackets, and if you want to take under account backslashed (escaped) ones, the proposed solutions would not work. Instead, you would write something like this (building on the suggested solution in perlre):

$re = qr/
(                                                # paren group 1 (full function)
    foo
    (?<paren_group>                              # paren group 2 (parens)
        \(
            (                                    # paren group 3 (contents of parens)
                (?:
                    (?> (?:\\[()]|(?![()]).)+ )  # escaped parens or no parens
                    |
                    (?&paren_group)              # Recurse to named capture group
                )*
            )
        \)
    )
)
/x;
Jacques
  • 991
  • 1
  • 12
  • 15
1

I agree with Scott Rippey, more or less, about writing your own parser. Here's a simple one:

my $in = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, ' .
         'aaa="bbb{}" { aa="b}b" }, ' .
         'aaa="bbb,ccc"'
;

my @out = ('');

my $nesting = 0;
while($in !~ m/\G$/cg)
{
  if($nesting == 0 && $in =~ m/\G,\s*/cg)
  {
    push @out, '';
    next;
  }
  if($in =~ m/\G(\{+)/cg)
    { $nesting += length $1; }
  elsif($in =~ m/\G(\}+)/cg)
  {
    $nesting -= length $1;
    die if $nesting < 0;
  }
  elsif($in =~ m/\G((?:[^{}"]|"[^"]*")+)/cg)
    { }
  else
    { die; }
  $out[-1] .= $1;
}

(Tested in Perl 5.10; sorry, I don't have Perl 5.8 handy, but so far as I know there aren't any relevant differences.) Needless to say, you'll want to replace the dies with something application-specific. And you'll likely have to tweak the above to handle cases not included in your example. (For example, can quoted strings contain \"? Can ' be used instead of "? This code doesn't handle either of those possibilities.)

ruakh
  • 175,680
  • 26
  • 273
  • 307
  • I'm glad to know a Perl-speaker agrees with my answer ... I only speak PCRE, so my answer made the bold assumption that a parser would be easier than the possibly impossible Regex. – Scott Rippey Nov 02 '11 at 07:11
  • I don't see anything here that would prevent it from working the same on Perl5 version 8 – Brad Gilbert Nov 02 '11 at 16:55
0

Try something like this:

use strict;
use warnings;
use Data::Dumper;

my $exp=<<END;
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }     , aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
END

chomp $exp;
my @arr = map { $_ =~ s/^\s*//; $_ =~ s/\s* $//; "$_}"} split('}\s*,',$exp);
print Dumper(\@arr);
Reza S
  • 9,480
  • 3
  • 54
  • 84
  • Thank you for the response. I found it is breaking when matching something like `aa="bb},cc"`. – meharo Nov 02 '11 at 19:08
-1

Although Recursive Regular Expressions can usually be used to capture "balanced braces" {}, they won't work for you, because you ALSO have the requirement to match "balanced quotes" ".
This would be a very tricky task for a Perl Regular Expression, and I'm fairly certain it's not possible. (In contrast, it could probably be done with Microsoft's "balancing groups" Regex feature).

I would suggest creating your own parser. As you process each character, you count each " and {}, and only split on , if they are "balanced".

Community
  • 1
  • 1
Scott Rippey
  • 15,614
  • 5
  • 70
  • 85
  • 1
    I think it can be done in Perl, just not easily. Especially to a newer Perl programmer. Although it may be easier with [Regexp::Grammars](http://search.cpan.org/perldoc/Regexp::Grammars) style regular expressions. Using a **real** parser will work better, [Marpa](http://search.cpan.org/dist/Marpa/) perhaps. – Brad Gilbert Nov 02 '11 at 16:34
  • [Regexp::Grammars](http://search.cpan.org/perldoc/Regexp::Grammars) is not supported by 5.8 :( – meharo Nov 02 '11 at 19:24
  • It's very possible, but not something I'd recommend. :) – brian d foy Oct 08 '13 at 03:36