Parsing white-spaces in between lexemes using boost-spirit

Question

I want to parse a bnf grammar using boost::spirit. This parser works fine. However, I also want to be able read white-spaces that occur in between lexemes. For example, suppose I have a grammar like this:

<name> ::= <firtname> <surname>
<firtname> ::= <char><char> | <firstname><char>
<surname> ::= <char><char> | <surname><char>
<char>   ::= a | b | c ... | z

Suppose I have a rewriting system that uses the above grammar, I should have at the end for <name> something like David Harvey as the output. However if the <name> rule was written like this <name> ::= <firtname><surname>. The rewriting system should give an output like this DavidHarvey. This is because the rewriting system is white-space sensitive.

Ow. All this for a template expansion engine? You have been chasing "BNF parsing" for ... months now, and it turns out you need template expansion. — sehe, May 28 '21 at 21:24

score 0 · Accepted Answer · answered May 28 '21 at 21:54

Generation is a fundamentally different job than parsing.

Parsing removes redundancy and normalizes data. Generation adds redundancy and chooses (one of typically many) representations according to some goals (stylistic guides, efficiency goals etc).

By allowing yourself to get side-tracked with the BNF similarity, you've lost sight of your goals. As, in BNF many instances of whitespace are simply not significant.

This is manifest in the direct observation that the AST does not contain the whitespace.

Hacking It

The simplest way would be to represent the skipped whitespace instead as "string literals" inside your AST:

    _term       = _literal | _rule_name | _whitespace;

With

    _whitespace = +blank;

And then making the _list rule a lexeme as well (so as to not skip blanks):

    // lexemes
    qi::rule<Iterator, Ast::List()>   _list;
    qi::rule<Iterator, std::string()> _literal, _whitespace;

See it Live On Compiler Explorer

Clean Solution

The above leaves a few "warts": there are spots where whitespace is still not significant (namely around | and specifically before the list-attribute numbers):

<code>   ::=  <letter><digit> 34 | <letter><digit><code> 23
<letter> ::= "a" 1 | "b" 2 | "c" 3 | "d" 4 | "e" 5 | "f" 6 | "g" 7 | "h" 8 | "i" 9
<digit>  ::= "9" 10 | "1" 11 | "2" 12 | "3" 13 | "4" 14

I don't see how it would usefully be significant there, unless of course your input doesn't look like the input you've been using. E.g. if it looks like this instead:

<code>::=<letter><digit>34|<letter><digit><code>23
<letter>::="a"1|"b"2|"c"3|"d"4|"e"5|"f"6|"g"7|"h"8|"i"9
<digit>::="9"10|"1"11|"2"12|"3"13|"4"14

You could make all the rules lexeme. However, this doesn't add up with the presence of quoted strings, at all. The whole notion of quoted strings is to mark regions where normal whitespace (and comment) skipping is suspended.

I have a nagging feeling that you are much farther away from your actual problem (see https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) than we can even currently see, and you might even have stripped the whole quoted-string-literals concept from the "BNF" already.

A clean solution would be to forget about misleading similarities with BNF and just devise your own grammar from the ground up.

If the goal is simply to have a (recursive) macro/template expansion engine, it should really turn out a lot simpler than what you currently have. Maybe you can describe your real task (input, desired output and required behaviors) so we can help you achieve that?

Added demo **[Live On Compiler Explorer](https://godbolt.org/z/734xv9jbf)** — sehe, May 28 '21 at 21:56
I do agree with you, the space around the pipe symbol are not of any use, the only white-spaces of use are the ones between the terms. — r360, Jun 07 '21 at 22:16
Any link to a simple "(recursive) macro template expansion engine" just to see if it is most appropriate for my intent. — r360, Jun 07 '21 at 22:21
Mmm. I've made too many but SO search isn't the best. I'll see what I can find — sehe, Jun 07 '21 at 22:28
Here's one oldie https://stackoverflow.com/questions/9404558/compiling-a-simple-parser-with-boost-spirit/9405546#9405546 Compare with the performance-oriented adaptation here https://stackoverflow.com/a/23517664/85371 - The comments there link yet another relevant example where I wrote both Spirit and non-Spirit answer. — sehe, Jun 07 '21 at 22:45
Here's a [Mustache](https://mustache.github.io/)-based example: https://stackoverflow.com/a/24131286/85371 — sehe, Jun 07 '21 at 22:50
Thank for this. Will see if the macro template expansion is best suited to handle whitespaces. I think the quoted string is to allow protected characters like `=, :` to be used as literals. — r360, Jun 08 '21 at 15:29

Parsing white-spaces in between lexemes using boost-spirit

1 Answers1

Hacking It

Clean Solution