Matching arbitrary delimiters

Question

I've had good success parsing complicated and silly old text formats with Marpa before and I'm trying to do it again.

This particular format has hundred and hundreds of different kinds of 'Begin' and 'End' blocks that look like this:

Begin BlahBlah
    asdf qwer 123
    987 xxxx
End BlahBlah

Begin FooFoo
    Begin BarBar
        some stuff (1,2,3)
    End BarBar
    whatever x
End FooFoo

How do I make a single rule that will match all of BlahBlah, BarBar, and FooFoo in the stuff above? I don't see in any examples how to dynamically capture the token and re-use it to terminate the rule, at least not with the standard scanless grammar examples. I don't want to enumerate all the different kinds of blocks because new kinds will break things, and I don't think it should be necessary.

The contents of the Begin/End blocks are immaterial to the question. In reality that stuff is a complicated mess, but nothing I don't know how to slog through. I'm hand-waving over other complicating details that make Marpa a good tool for this, such that I don't want to resort to regex.

At a bare minimum all I'm trying to achieve is a key-value map of the block type (i.e. "BlahBlah") to its contents as a string.

I'm guessing this somehow demonstrates what I need to do, but to be honest it's sailing over my head: https://github.com/jddurand/MarpaX-Languages-XML-AST/blob/master/lib/MarpaX/Languages/XML/AST/Grammar/XML_1_0.pm — rjt_jr, Feb 11 '16 at 16:25
For a start, you're going to need a "stack" for the nesting -- see perl's `push` and `pop`. The only valid `End xxx` you'll need to look for at any given time is whichever `xxx` is on top of the stack. As for "contents", is e.g. the `some stuff` to be part of the contents for FooFoo (as *well* as BarBar)? — Jeff Y, Feb 11 '16 at 17:17
Off the top of my head, there are two approaches with Marpa. One is to use an event for the "End" delimiter's tag, and manually make sure the delimiters match at parse time. This has the advantage of allowing "fast fails". However, if the delimiters are truly nested (that is, it is not valid for them to span each other), you can *not* match them at parse time, but then check that they match as they should in evaluation. This can be better, as it allows for better error messages, with more evaluation. — Jeffrey Kegler, Feb 11 '16 at 18:04
I'm reasonably sure all the nesting is "true" and I can just toss the repeated token after the 'End' string. All blocks are probably sufficiently delimited by the strings 'Begin' and 'End'. I was just hoping there was a cute syntactic trick to repeating the token. I suppose I'll toss/ignore the repeat token and hope for the best rather than resort to lexeme pause events. — rjt_jr, Feb 11 '16 at 19:15
I toyed with various ideas for such tricks, but never implemented one. Note that you can do a mixed solution -- ignore the repeat token, but save it and check that it's right during evaluation -- when evaluating that rule, you'll have both together and you can issue a really cool error message describing the mismatch exactly. — Jeffrey Kegler, Feb 12 '16 at 00:44

rjt_jr · Answer 1 · 2016-02-12T18:27:58.447

This doesn't exactly answer my original question because I ultimately arrived at simply ignoring the repeated string following the "End" token. I will probably follow the comment suggestion above of simply checking that the begin/end names match in a post-processing step. Operating under the assumption that the token is redundant, this seems to work OK, as a rough first cut. Critique welcome:

#!/usr/bin/perl
use warnings;
use strict;
use v5.18;
use utf8;
use feature 'unicode_strings';
use autodie;

use Marpa::R2;
use Data::Dumper;

my $g = Marpa::R2::Scanless::G->new({
        source         => \(<<'END_OF_SOURCE'),
lexeme default = latm => 1
:default ::= action => ::array
:start ::= beginend_blocks
:discard ~ <ws>

beginend_blocks ::= beginend_block+

beginend_block ::= beginend_block_header beginend_block_contents

beginend_block_header ::= ('Begin') beginend_block_name action => ::first

beginend_block_name ::= <word> 

beginend_block_contents ::= beginend_block_content_elems (beginend_block_terminator) (<word>)

beginend_block_content_elems ::= beginend_block_content_elem+
beginend_block_content_elem ::= word            action => ::first
                              | beginend_block  action => ::first

beginend_block_terminator ::= ('End')

<word> ~ <wordchar>+
<wordchar> ~ [\S]

<ws> ~ [\s]+

END_OF_SOURCE
});


my $test_str = <<THEDATA;
Begin BlahBlah
    asdf qwer 123
    987 xxxx
End BlahBlah

Begin FooFoo
    something else
    Begin BazBaz
        some stuff (1,2,3)
    End BazBaz
    whatever x
    Begin BarBar
        some stuff (1,2,3)
    End BarBar
    whatever y 
End FooFoo
THEDATA

MAIN: {
    my $re = Marpa::R2::Scanless::R->new({ grammar => $g, trace_terminals => 0 });

    for (my $pos = $re->read(\$test_str); $pos < length $test_str; $pos = $re->resume) {
        my ($pause_start, undef) = $re->pause_span;
    }

    say Dumper $re->value;
}

Matching arbitrary delimiters

1 Answers1