10

I need to match the values of key = value pairs in BibTeX files, which can contain arbitrarily nested braces, delimited by braces. I've got as far as matching at most two deep nested curly braces, like {some {stuff} like {this}} with the kludgey:

    token brace-value {
    '{' <-[{}]>* ['{' <-[}]>* '}' <-[{}]>* ]* '}'
    }

I shudder at the idea of going one level further down... but proper parsing of my BibTeX stuff needs at least three levels deep.

Yes, I know there are BibTeX parsers around, but I need to grab the complete entry for further processing, and peek at a few keys meanwhile. My *.bib files are rather tame (and I wouldn't mind to handle a few stray entries by hand), the problem is that I have a lot of them, with much overlap. But some of the "same" entries have different keys, or extra data. I want to consolidate them into a few master files (the whole idea behind BibTeX, right?). Not fun by hand if bibtool gives a file with no duplicates (ha!) of some 20 thousand lines...

vonbrand
  • 11,412
  • 8
  • 32
  • 52
  • See https://docs.raku.org/language/regexes#Tilde_for_nesting_structures. That looks like what you need here – michaeldel Jul 17 '20 at 03:40
  • Can you please provide a string to check? – jjmerelo Jul 17 '20 at 06:53
  • 1
    See also [How to refer to previously matched items in a grammar?](https://stackoverflow.com/q/47248329/2173773) and [Parsing a possibly nested braced item using a grammar](https://stackoverflow.com/q/47124405/2173773) – Håkon Hægland Jul 17 '20 at 07:49
  • @HåkonHægland, that handles *one* level of braces. I need at least three, – vonbrand Jul 17 '20 at 22:01
  • 2
    @jjmerelo, `{This {has {nested {braces} {inside {several {times} even} nested} again}} end}` – vonbrand Jul 17 '20 at 22:23
  • 2
    @michaeldel, `'{' ~ '}' ` is (more or less) another way to write `'{' '}'`, no further nesting. – vonbrand Jul 17 '20 at 22:25
  • @vonbrand Fwiw I'm working on a new version of my answer. I think it'll be illuminating but it's taking its sweet time to properly come together. Hopefully I'll get it in the bag tomorrow. – raiph Jul 18 '20 at 23:26

3 Answers3

5

After perusing Lenz' "Parsing with Perl 6 Regexes and Grammars" (Apress, 2017), I realized the "regex" machinery (based on backtracking) might actually be a lot more capable than officially admitted, as a regex can call another, and nowhere do I see a prohibition on recursive calls.

Before digging in, a bit of context free grammars: A way to describing nested braces (and nothing else) is with the grammar:

S -> { S } S | <nothing>

I.e., nested braces are either an opening brace, nested braces, a closing brace, more nested braces; or nothing whatsoever. This translates more or less directly to Raku (there is no empty regex, fake it by making the construction optional):

my regex nb {
   [ '{' <nb> '}' <nb> ]?
}

Lo and behold, this works. Need to fix up to avoid captures, kill backtracking (if it doesn't match on the first try, it won't ever match), and decorate with "anything else" fillers.

my regex nested-braces {
    :ratchet 
     <-[{}]>*
     [ '{' <.nested-braces> '}' <.nested-braces> ]?
     <-[{}]>*
};

This checks out with my test cases.

For not-so-adventurous souls, there is the Text::Balanced module for Perl (formerly Perl 5, callable from Raku using Inline::Perl5). Not directly useful to me inside a grammar, unfortunately.

vonbrand
  • 11,412
  • 8
  • 32
  • 52
  • 3
    To make it avoid backtracking, you generally just need to call it a `token` instead of a `regex` (or use the modifier `:r` [for :ratcheting] somewhere in the regex) – user0721090601 Jul 17 '20 at 14:04
  • 1
    @user0721090601, yes, tthat was what I had in mind. Just was laying out proof-of-concept. Final version in now. – vonbrand Jul 17 '20 at 14:34
  • 2
    "context free grammar" The **What *are* Raku Rules / grammars?** section of [my answer to SO **What's the real difference between a token and a rule?**](https://stackoverflow.com/a/62053666/1077672) may be of interest. – raiph Jul 18 '20 at 00:27
4

Solution

A way to describe nested braces (and nothing else)

Presuming a rule named &R, I'd likely write the following pattern if I was writing a quick small one-off script:

\{ <&R>* \} 

If I was writing a larger program that should be maintainable I'd likely be writing a grammar and, using a rule named R the pattern would be:

'{' ~ '}' <R>*

This latter avoids leaning toothpick syndrome and uses the regex ~ operator.

These will both parse arbitrarily deeply nested paired braces, eg:

say '{{{{}}}}' ~~ token { \{ <&?ROUTINE>* \} } # 「{{{{}}}}」

(&?ROUTINE refers to the routine in which it appears. A regex is a routine. (Though you can't use <&?ROUTINE> in a regex declared with / ... / syntax.)

regex vs token

kill backtracking

my regex nested-braces {
    :ratchet 

The only difference between patterns declared with regex and token is that the former turns ratcheting off. So using it and then immediately turning ratcheting on is notably unidiomatic. Instead:

my token nested-braces {

Backtracking

the "regex" machinery (based on backtracking)

The grammar/regex engine does include backtracking as an optional feature because that's occasionally exactly what one wants.

But the engine is not "based on backtracking", and many grammars/parsers make little or no use of backtracking.

Recursion

a regex can call another, and nowhere do I see a prohibition on recursive calls.

This alone is nothing special for contemporary regex engines.

PCRE has supported recursion since 2000, and named regexes since 2003. Perl's default regex engine has supported both since 2007.

Their support for deeper levels of recursion and more named regexes being stored at once has been increasing over time.

Damian Conway's PPR uses these features of regexes to build non-trivial (but still small) parse trees.

Capabilities

a lot more capable

Raku "regexes" can be viewed as a cleaned up take on the unfolding regex evolution. To the degree this helps someone understand them, great.

But really, it's a whole new deal. For example, they're turing complete, in a sensible way, and thus able to parse anything.

than officially admitted

Well that's an odd thing to say! Raku's Grammars are frequently touted as one of Raku's most innovative features.

There are three major caveats:

  • Performance The primary current caveat is that a well written C parser will blow the socks off a well written Raku Grammar based parser.

  • Pay off It's often not worth the effort it takes to write a fully correct parser for a non-trivial format if there's an existing parser.

  • Left recursion Raku does not automatically rewrite left recursion (infinite loops).

Using existing parsers

I know there are BibTeX parsers around, but I need to grab the complete entry for further processing, and peek at a few keys meanwhile.

Using a foreign module in Raku can be a bit of a revelation. It is not necessarily like anything you'll have experienced before. Raku's foreign language adaptors can do smart marshaling for you so it can be like you're using native Raku features.

Two of the available foreign language adaptors are already sufficiently polished to be amazing -- the ones for Perl and for C.

I'm pretty sure there's a BibTeX package for Perl that wraps a C BibTeX parser. If you used that you'd hopefully get parsing results all nicely wrapped up into Raku objects as if it was all Raku in the first place, but retaining much of the high performance of the C code.

A Raku BibTeX Grammar?

Perhaps your needs do call for creating and using a small Raku Grammar.

(Maybe you're doing this partly as an exercise to familiarize yourself with Raku, or the regex/grammar aspect of Raku. For that it sounds pretty ideal.)

As soon as you begin to use multiple regexes together -- even just two -- you are closing in on grammar territory. After all, they're just an easy-to-use construct for using multiple regexes together.

So if you decide you want to stick with writing parsing code in Raku, expect to write it something like this:

grammar BiBTeX {
  token TOP { ... }
  token ...
  token ...
}
BiBTeX.parse: my-bib-file

For more details, see the official doc's Grammar tutorial or read Moritz's book.

jjmerelo
  • 22,578
  • 8
  • 40
  • 86
raiph
  • 31,607
  • 3
  • 62
  • 111
  • 1
    "Though you can't use <&?ROUTINE> in a /…/" … you can't use <&ROUTINE> but there *is* a syntax, but I'll be damned if I remember off hand. Someone posted it on something here though. i think it was <~~> – user0721090601 Jul 18 '20 at 05:48
  • 1
    Yup. Just double checked. `/'{' ~ '}' <~~>*/` will match any random sequence of balanced `{{{{{ }}}}}` – user0721090601 Jul 18 '20 at 05:53
  • hey @user0721090601 - that’s cool - how does it look with capturing the contents and maybe actions?* – librasteve Jul 18 '20 at 20:26
  • @p6steve Only `<...>` assertions that start with an alpha, capture. So `<~~>` won't separately capture recursive calls. (I just tested to confirm.) So for independently capturing recursive calls you'd need to use a rule in a method and explicitly name the rule in a recursive call. I've also tested actions, and they are called once for each time a particular call successfully completes its matching, as one would expect. – raiph Jul 18 '20 at 23:11
  • 1
    @raiph that's true, *but* it's fairly easy to simply add in `$=<~~>` which turns into an easily accessibly named capture. – user0721090601 Jul 19 '20 at 00:52
  • Sorry, but the Raku grammar documentation (and much else) is intelligible only if you know exactly what they are talking about. No decent examples, nothing. – vonbrand Jul 19 '20 at 03:02
  • @user0721090601, all I get is "unrecognized regex character ~" – vonbrand Jul 19 '20 at 03:07
  • thanks @user0721090601 and raiph ... that’s the incantation i was hoping for. – librasteve Jul 19 '20 at 09:23
  • @vonbrand you make a good point about the docs BUT i really like the current terse and searchable doc site as a concise tool for easy reference - however, i do think that the explication of Grammars for new users could use some worked examples - fwiw there is a lot of material in the raku advent posts... – librasteve Jul 19 '20 at 09:23
  • @vonbrand I didn't mean to suggest it was documented. I looked for it and didn't find it. It works for me. "unrecognized regex character ~" suggests you didn't write the four characters `<~~>`. – raiph Jul 19 '20 at 09:25
0

OK, just (re) checked. The documentation of '{' ~ '}' leaves a whole lot to desire, it is not at all clear it is meant to handle balanced, correctly nested delimiters.

So my final solution is really just along the lines:

my regex nested-braces {
   :ratchet
   '{' ~ '}' .*
}

Thanks everyone! Lerned quite a bit today.

vonbrand
  • 11,412
  • 8
  • 32
  • 52
  • `my regex nested-braces { :ratchet '{' ~ '}' .* }` will match a *single* pair, without regard for nesting/balance. For example, matched against `'{}}}}}nnn}'` it returns `'「{}}}}}nnn}」'`. – raiph Jul 19 '20 at 22:46
  • "it is not at all clear [`'{' ~ '}'`] is meant to handle balanced, correctly nested delimiters". It isn't meant to. It's for matching a *single* pair. The doc section title ("Tilde for nesting structures") is unfortunate. The doc section body is spot on -- but it's [verbiage Larry wrote for compiler implementors](https://design.raku.org/S05.html#line_879), not users. The "nested" in "nested subrules" is about nested subrules matching what goes between a single occurrence of a pair of delimiters. – raiph Jul 19 '20 at 22:50
  • 1
    The reason `~` exists in the regex syntax is to put the delimiters syntactically near each other, instead of on opposite sides of the middle. That is the only thing it does. `'{' ~ '}' .*` works exactly the same as `'{' .* '}'`. – Brad Gilbert Jul 21 '20 at 00:36