Parsing a possibly nested braced item using a grammar

Question

I am starting to write BibTeX parser. The first thing I would like to do is to parse a braced item. A braced item could be an author field or a title for example. There might be nested braces within the field. The following code does not handle nested braces:

use v6;

my $str = q:to/END/;
  author={Belayneh, M. and Geiger, S. and Matth{\"{a}}i, S.K.}, 
  END

$str .= chomp;

grammar ExtractBraced {
    rule TOP {
        'author=' <braced-item> .*
    }
    rule braced-item      { '{' <-[}]>* '}' }
}

ExtractBraced.parse( $str ).say;

Output:

｢author={Belayneh, M. and Geiger, S. and Matth{\"{a}}i, S.K.},｣
 braced-item => ｢{Belayneh, M. and Geiger, S. and Matth{\"{a}｣

Now, in order to make the parser accept nested braces, I would like to keep a counter of the number of opening braces currently parsed and when encountering a closing brace, we decrement the counter. If the counter reaches zero, we assume that we have parsed the complete item.

To follow this idea, I tried to split up the braced-item regex, to implement an grammar action on each char. (The action method on the braced-item-char regex below should then handle the brace-counter):

grammar ExtractBraced {
    rule TOP {
        'author=' <braced-item> .*
    }
    rule braced-item      { '{' <braced-item-char>* '}' }
    rule braced-item-char { <-[}]> }
}

However, suddenly now the parsing fails. Probably a silly mistake, but I cannot see why it should fail now?

1. My, er, rule is to always use `token` unless I *know* I want a `rule` or a `regex`. Use `token braced-item-char ...` to make progress. 2. I isolated the problem in a few seconds by adding `use Grammar::Tracer`. Have you read [my SO answer about debugging grammars](https://stackoverflow.com/a/19640657/1077672)? 3. Why not have the regex engine track the recursion levels rather than introduce manual counting? 4. Have you seen [my OTT answer to a bibtex question](https://stackoverflow.com/a/45181464/1077672)? — raiph, Nov 05 '17 at 18:01
@raiph Thanks! Using `token` instead of `rule` solved the problem. I am curious how I could have the regex engine track the recursions? I will definitely have a look at your other posts! — Håkon Hægland, Nov 05 '17 at 18:05
Remember that P6 regexes are perfectly happy with recursion. Maybe [a balanced brackets example](https://examples.perl6.org/categories/best-of-rosettacode/balanced-brackets.html) serves for inspiration? — raiph, Nov 05 '17 at 18:12
@raiph Wow, this is great! I did not think about making the rules or tokens recursive.. Thanks again. — Håkon Hægland, Nov 05 '17 at 18:16
You're welcome. Note that, in addition to P6's regex engine (NQP), some other leading regex engines support recursion, including P5's default engine and [the PCRE engine](https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions) for the last decade or so. That said, P6/nqp syntax is sweet, in stark contrast to the horrible older P5 style regex syntax, and NQP is happy producing a parse tree with thousands of recursive match objects whereas PCRE runs out of stack space if you have a lot of matches. — raiph, Nov 05 '17 at 18:38

score 6 · Accepted Answer · answered Nov 05 '17 at 19:57

Without knowing how you want the resultant data to look I would change it to look something like this:

my $str = ｢author={Belayneh, M. and Geiger, S. and Matth{\"{a}}i, S.K.},｣;

grammar ExtractBraced {
    token TOP {
        'author='
        $<author> = <.braced-item>
        .*
    }
    token braced-item {
       '{' ~ '}'

           [
           || <- [{}] >+
           || <.before '{'> <.braced-item>
           ]*
    }
}

ExtractBraced.parse( $str ).say;

｢author={Belayneh, M. and Geiger, S. and Matth{\"{a}}i, S.K.},｣
 author => ｢{Belayneh, M. and Geiger, S. and Matth{\"{a}}i, S.K.}｣

If you want a bit more structure It might look a bit more like this:

my $str = ｢author={Belayneh, M. and Geiger, S. and Matth{\"{a}}i, S.K.},｣;

grammar ExtractBraced {
    token TOP {
        'author='
        $<author> = <.braced-item>
        .*
    }
    token braced-part {
        || <- [{}] >+
        || <.before '{'> <braced-item>
    }
    token braced-item {
        '{' ~ '}'
            <braced-part>*
    }
}

class Print {
    method TOP ($/){
        make $<author>.made
    }
    method braced-part ($/){
        make $<braced-item>.?made // ~$/
    }
    method braced-item ($/){
        make [~] @<braced-part>».made
    }
}


my $r = ExtractBraced.parse( $str, :actions(Print) );
say $r;
put();
say $r.made;

｢author={Belayneh, M. and Geiger, S. and Matth{\"{a}}i, S.K.},｣
 author => ｢{Belayneh, M. and Geiger, S. and Matth{\"{a}}i, S.K.}｣
  braced-part => ｢Belayneh, M. and Geiger, S. and Matth｣
  braced-part => ｢{\"{a}}｣
   braced-item => ｢{\"{a}}｣
    braced-part => ｢\"｣
    braced-part => ｢{a}｣
     braced-item => ｢{a}｣
      braced-part => ｢a｣
  braced-part => ｢i, S.K.｣

Belayneh, M. and Geiger, S. and Matth\"ai, S.K.

Note that the + on <-[{}]>+ is an optimization, as well as <before '{'>, both can be omitted and it will still work.

Parsing a possibly nested braced item using a grammar

1 Answers1

Linked