How to handle multi-line rules for gor parsing bnf grammar using boost spirit qi

Question

Assuming I have a BNF grammar like this

<code>   ::=  <letter><digit> | <letter><digit><code>
<letter> ::= a | b | c | d | e
             | f | g | h | i
<digit>  ::= 0 | 1 | 2 | 3 |
             4

If you look at the <letter> rule, its continuation starts with the | but that of the <digit> rule starts with the production with | appearing at the end of the previous line. I also don't want to use a particular symbol to represent the end of a rule.

How do check if a rule as ended using the Boost Spirit Qi for implementation.

I have just gone through the tutorial on the boost page and wondering how I am going to handle this.

Are you saying you want to parse BNF with Spirit? Or translating a BNF grammar into Qi rules? If the former, the real question is "How does BNF signal the end of a production". Once you know, I can tell you how to _check_ that in Qi — sehe, Jul 18 '20 at 14:24

sehe · Accepted Answer · 2020-07-18T22:22:50.107

Wikipedia

BNF syntax can only represent a rule in one line, whereas in EBNF a terminating character, the semicolon character “;” marks the end of a rule.

So the simple answer is: the input isn't BNF.

Iff you want to support it anyways (at your own peril :)) you'll have to make it so. So, let's write a simplistic BFN grammar, literally mapping from Wikipedia BNF

<syntax>         ::= <rule> | <rule> <syntax>
<rule>           ::= <opt-whitespace> "<" <rule-name> ">" <opt-whitespace> "::=" <opt-whitespace> <expression> <line-end>
<opt-whitespace> ::= " " <opt-whitespace> | ""
<expression>     ::= <list> | <list> <opt-whitespace> "|" <opt-whitespace> <expression>
<line-end>       ::= <opt-whitespace> <EOL> | <line-end> <line-end>
<list>           ::= <term> | <term> <opt-whitespace> <list>
<term>           ::= <literal> | "<" <rule-name> ">"
<literal>        ::= '"' <text1> '"' | "'" <text2> "'"
<text1>          ::= "" | <character1> <text1>
<text2>          ::= '' | <character2> <text2>
<character>      ::= <letter> | <digit> | <symbol>
<letter>         ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
<digit>          ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<symbol>         ::=  "|" | " " | "!" | "#" | "$" | "%" | "&" | "(" | ")" | "*" | "+" | "," | "-" | "." | "/" | ":" | ";" | ">" | "=" | "<" | "?" | "@" | "[" | "\" | "]" | "^" | "_" | "`" | "{" | "}" | "~"
<character1>     ::= <character> | "'"
<character2>     ::= <character> | '"'
<rule-name>      ::= <letter> | <rule-name> <rule-char>
<rule-char>      ::= <letter> | <digit> | "-"

It could look like this:

template <typename Iterator>
struct BNF: qi::grammar<Iterator, Ast::Syntax()> {
    BNF(): BNF::base_type(start) {
        using namespace qi;
        start = skip(blank) [ _rule % +eol ];

        _rule       = _rule_name >> "::=" >> _expression;
        _expression = _list % '|';
        _list       = +_term;
        _term       = _literal | _rule_name;
        _literal    = '"' >> *(_character - '"') >> '"'
                    | "'" >> *(_character - "'") >> "'";
        _character  = alnum | char_("\"'| !#$%&()*+,./:;>=<?@]\\^_`{}~[-");
        _rule_name  = '<' >> (alpha >> *(alnum | char_('-'))) >> '>';

        BOOST_SPIRIT_DEBUG_NODES(
            (_rule)(_expression)(_list)(_term)
            (_literal)(_character)
            (_rule_name))
    }

  private:
    qi::rule<Iterator, Ast::Syntax()>     start;
    qi::rule<Iterator, Ast::Rule(),       qi::blank_type> _rule;
    qi::rule<Iterator, Ast::Expression(), qi::blank_type> _expression;
    qi::rule<Iterator, Ast::List(),       qi::blank_type> _list;
    // lexemes
    qi::rule<Iterator, Ast::Term()>       _term;
    qi::rule<Iterator, Ast::Name()>       _rule_name;
    qi::rule<Iterator, std::string()>     _literal;
    qi::rule<Iterator, char()>            _character;
};

Now it will parse your sample (corrected to be BNF):

    std::string const input = R"(<code>   ::=  <letter><digit> | <letter><digit><code>
<letter> ::= "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i"
<digit>  ::= "0" | "1" | "2" | "3" | "4"
    )";

Live On Compiler Explorer

Prints:

code ::= {<letter>, <digit>} | {<letter>, <digit>, <code>}
letter ::= {a} | {b} | {c} | {d} | {e} | {f} | {g} | {h} | {i}
digit ::= {0} | {1} | {2} | {3} | {4}
Remaining: "
    "

Support Line-Wrapped Rules

The best way is to not accept them - since the grammar wasn't designed for it unlike e.g. EBNF.

You can force the issue by doing a negative look-ahead in the skipper:

_skipper = blank | (eol >> !_rule);
start = skip(_skipper) [ _rule % +eol ];

For technical reasons (Boost spirit skipper issues) that doesn't compile, so we need to feed it a placeholder skipper inside the look-ahead:

_blank = blank;
_skipper = blank | (eol >> !skip(_blank.alias()) [ _rule ]);
start = skip(_skipper.alias()) [ _rule % +eol ];

Now it parses the same but with various line-breaks:

    std::string const input = R"(<code>   ::=  <letter><digit> | <letter><digit><code>
<letter> ::= "a" | "b" | "c" | "d" | "e"
           | "f" | "g" | "h" | "i"
<digit>  ::= "0" | "1" | "2" | "3" |
             "4"
    )";

Printing:

code ::= {<letter>, <digit>} | {<letter>, <digit>, <code>}
letter ::= {a} | {b} | {c} | {d} | {e} | {f} | {g} | {h} | {i}   
digit ::= {0} | {1} | {2} | {3} | {4}

FULL LISTING

Compiler Explorer

//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted.hpp>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
#include <iomanip>
namespace qi = boost::spirit::qi;

namespace Ast {
    struct Name : std::string {
        using std::string::string;
        using std::string::operator=;

        friend std::ostream& operator<<(std::ostream& os, Name const& n) {
            return os << '<' << n.c_str() << '>';
        }
    };

    using Term = boost::variant<Name, std::string>;
    using List = std::list<Term>;
    using Expression = std::list<List>;

    struct Rule {
        Name name; // lhs
        Expression rhs;
    };

    using Syntax = std::list<Rule>;
}

BOOST_FUSION_ADAPT_STRUCT(Ast::Rule, name, rhs)

namespace Parser {
    template <typename Iterator>
    struct BNF: qi::grammar<Iterator, Ast::Syntax()> {
        BNF(): BNF::base_type(start) {
            using namespace qi;
            _blank = blank;
            _skipper = blank | (eol >> !skip(_blank.alias()) [ _rule ]);
            start = skip(_skipper.alias()) [ _rule % +eol ];

            _rule       = _rule_name >> "::=" >> _expression;
            _expression = _list % '|';
            _list       = +_term;
            _term       = _literal | _rule_name;
            _literal    = '"' >> *(_character - '"') >> '"'
                        | "'" >> *(_character - "'") >> "'";
            _character  = alnum | char_("\"'| !#$%&()*+,./:;>=<?@]\\^_`{}~[-");
            _rule_name  = '<' >> (alpha >> *(alnum | char_('-'))) >> '>';

            BOOST_SPIRIT_DEBUG_NODES(
                (_rule)(_expression)(_list)(_term)
                (_literal)(_character)
                (_rule_name))
        }

      private:
        using Skipper = qi::rule<Iterator>;
        Skipper _skipper, _blank;

        qi::rule<Iterator, Ast::Syntax()>     start;
        qi::rule<Iterator, Ast::Rule(),       Skipper> _rule;
        qi::rule<Iterator, Ast::Expression(), Skipper> _expression;
        qi::rule<Iterator, Ast::List(),       Skipper> _list;
        // lexemes
        qi::rule<Iterator, Ast::Term()>       _term;
        qi::rule<Iterator, Ast::Name()>       _rule_name;
        qi::rule<Iterator, std::string()>     _literal;
        qi::rule<Iterator, char()>            _character;
    };
}

int main() {
    Parser::BNF<std::string::const_iterator> const parser;

    std::string const input = R"(<code>   ::=  <letter><digit> | <letter><digit><code>
<letter> ::= "a" | "b" | "c" | "d" | "e"
           | "f" | "g" | "h" | "i"
<digit>  ::= "0" | "1" | "2" | "3" |
             "4"
    )";

    auto it = input.begin(), itEnd = input.end();

    Ast::Syntax syntax;
    if (parse(it, itEnd, parser, syntax)) {
        for (auto& rule : syntax)
            fmt::print("{} ::= {}\n", rule.name, fmt::join(rule.rhs, " | "));
    } else {
        std::cout << "Failed\n";
    }

    if (it != itEnd)
        std::cout << "Remaining: " << std::quoted(std::string(it, itEnd)) << "\n";
}

Also Live On Coliru (without libfmt)

Actually **[Live On Coliru](http://coliru.stacked-crooked.com/a/786e922328a6dd35)** (simplified output) — sehe, Jul 18 '20 at 16:27
Fixed the [Compiler Explorer](https://godbolt.org/z/6cvKqM) link. /HT @vitaut — sehe, Jul 18 '20 at 22:25
just a quick one so when exactly does the Spirit Lexer become useful. — user2987773, Jul 19 '20 at 12:53
also how do you get double quotes (") to be read if they form part or is a production. Say ` ::= " | : | '` — user2987773, Jul 19 '20 at 13:49
Yeah. It seems that the syntax I copied from Wikipedia (see answer) contradicts [_"The BNF uses the symbols (<, >, |, ::=) for itself, but does not include quotes around terminal strings. This prevents these characters from being used in the languages, and requires a special symbol for the empty string. In EBNF, terminals are strictly enclosed within quotation marks ("…" or '…')"_](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form#Advantages_over_BNF). — sehe, Jul 19 '20 at 14:07
To be honest, you still have to tell us what you are trying to achieve ([asked 23h ago](https://stackoverflow.com/questions/62967856/xx/62970956?noredirect=1#comment111354512_62967856)). When we know, we might help you decide what you want. — sehe, Jul 19 '20 at 14:07
Here's a "random" imagined tweak that would allow for a rule like you showed: https://godbolt.org/z/zq6GE1 (may require a trip to [here for `hold[]`](https://stackoverflow.com/questions/13869978/boostspiritqi-duplicate-parsing-on-the-output/13875183#13875183)) — sehe, Jul 19 '20 at 14:30
About lexers: [probably never](https://stackoverflow.com/questions/47191531/cannot-get-boostspirit-parserlexer-working-for-token-types-other-than-stdst/47232006#47232006). See [many other examples](https://stackoverflow.com/search?q=user%3A85371+lexer) — sehe, Jul 19 '20 at 14:34
what's the difference between `skip(blank) [ _rule % +eol ]` and `(eol >> !skip(_blank.alias()) [ _rule ])`. Is the latter responsible for handling the multi-lines. — user2987773, Jul 20 '20 at 15:53
Huh. You seem to comparing two things that aren't exchangeable. the `!a` negative look-ahead assertion makes sure that that immediately following the newline there doesn't seem to be a `_rule`. If so, the newline can be skipped. — sehe, Jul 20 '20 at 16:34
what will be your advice if I want to support escape characters in this grammar (newline && tab especially). I saw this [post](http://boost-spirit.com/home/articles/karma-examples/generate-escaped-string-output-using-spirit-karma/). Trying to integrate it hasn't worked. — user2987773, Jul 24 '20 at 17:17
Also, reading from file doesn't seem to be working properly for the double and single quoted rules as they are read as part of the characters they enclose. — user2987773, Jul 24 '20 at 19:00
@user2987773 That link points to Karma. That's literally the opposite of Qi. You can look here: https://stackoverflow.com/questions/61695235/creating-a-boostspiritx3-parser-for-quoted-strings-with-escape-sequence-hand/61696706#61696706 - or any of the other Qi answers that mention escapes — sehe, Jul 24 '20 at 21:08
"Also, reading from file doesn't seem to be working properly for the double and single quoted rules as they are read as part of the characters they enclose" - I have no idea what you mean. What is read as part of "the characters they enclose"? — sehe, Jul 24 '20 at 21:10
what I meant was for example `"w"` gets read as it is, instead of minus the double quotes. — user2987773, Jul 24 '20 at 22:38
I'm very confused, the [live demo](https://godbolt.org/z/6cvKqM) clearly shows that not being the case. Of course, [when I tweaked it _on your request_](https://stackoverflow.com/questions/62967856/how-to-handle-multi-line-rules-for-gor-parsing-bnf-grammar-using-boost-spirit-qi/62970956?noredirect=1#comment111375060_62970956) the input needed to be changed *as* well. It is ***really*** time for you to make up your mind what it is exactly that you want to do, instead of getting clogged up on he "how". We're not seeing the forest for the trees because you're only drawing scattered branches. — sehe, Jul 24 '20 at 23:18

How to handle multi-line rules for gor parsing bnf grammar using boost spirit qi

1 Answers1

Support Line-Wrapped Rules

FULL LISTING

Linked