AST versus synthesized attributes of non-terminal parsers

Question

I am trying to store the results of a tree parser into a recursive struct using Boost spirit x3.

I got a nice grammar, and a nice ast. But I struggle understanding how they (refuse to) connect.

Here is the AST:


#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/x3/support/ast/variant.hpp>

#include <boost/fusion/adapted/struct/adapt_struct.hpp>
#include <boost/fusion/include/adapt_struct.hpp>

#include <optional>

namespace ast
{

  struct node
  {
    // are the optional necessary here?
    std::optional<std::vector<node>> children;
    std::optional<std::string> name;
    std::optional<double> length;

  };
}

BOOST_FUSION_ADAPT_STRUCT(ast::node, children, name, length)

Then, this is the grammar that is supposed to synthesize those nodes: (I added few notes of what I believe the non-terminal parsers are synthesizing).

#include <boost/spirit/home/x3.hpp>
#include "ast.h"
#include <boost/fusion/include/vector.hpp>
#include <boost/fusion/include/std_pair.hpp>

  namespace parser
  {
    namespace x3 = boost::spirit::x3;

    // synthesize: it should be a simple ast::node, but the grammar says otherwise???
    x3::rule<class tree, ast::node> const tree{"tree"};
    x3::rule<class branch> branch{"branch"};

    // synthesize: std::string
    auto name     = x3::lexeme[x3::alpha >> *x3::alnum]; // to be improved later
    // synthesize: double
    auto length   = ':' >> x3::double_;
    // synthesize: std::optional<std::string>
    auto leaf     = -name;
    // synthesize: std::pair< std::vector<...>, std::optional<std::string> >
    auto internal = '(' >> (branch % ',') >> ')' >> -name;
    // synthesized attribute: std::variant<node, std::optional<std::string>
    auto subtree  = internal | leaf;
    // synthesize: std::pair<node, std::optional<double>>
    auto const tree_def   = x3::skip(x3::blank)[subtree >> -length >> ';' >> x3::eoi];
    // synthesize: std::pair<node, std::optional<double>>
    auto const branch_def = subtree >> -length;

    BOOST_SPIRIT_DEFINE(branch, tree);
  } // end namespace parser

  // What is that for?
  auto tree()
  {
    return parser::tree;
  }

Despite many attempts to "fix" it, it fails to compile with a /boost/spirit/home/x3/operator/detail/sequence.hpp:143:9: error: static_assert failed due to requirement 'actual_size <= expected_size' "Size of the passed attribute is bigger than expected."

My guess is that the AST is not compatible with the grammar (I don't see anything like a simple tuple<vector<node>, string, double> emerging from the grammar). If so, should I complexify the AST or find a simple way to express the grammar, or would it be a use-case of semantic action?

The "what is that for?" - that's to share rule instances across TU's without [running into SIOF](https://en.cppreference.com/w/cpp/language/siof). As long as you don't share static rule instances across TUs, you don't need it. — sehe, Nov 21 '22 at 13:43

score 2 · Accepted Answer · answered Nov 21 '22 at 16:19

First off, things are rarely a use-case for semantic actions (Boost Spirit: "Semantic actions are evil"?)¹.

Secondly, my gut says optional is likely unnecessary for the vector/string fields. In my mind, a sequence container is already a generalization of optional (where optional is like container with a maximum size of 1).

Thirdly, ~~as far as I remember there might be limitations with std::optional support.~~ Seems I'm mis-remembering from variant support Transitioning Boost Spirit parser from boost::variant to std::variant.

The Propagation Rules

Now, I'd suggest to keep two things in mind:

Always match the AST declarations as closely to your rule declarations as you can. In particular, your AST doesn't distinguish leaf nodes and internal nodes, which causes confusions because you hope to have both rules "magically" compatible with both.

Be specific about type coercions. x3::rule does this, and you can use it explicitly. One of my recurring x3 devices is as<T>[p] (or sometimes the functional version of it: as<T>(p, name)):

template <typename T> struct as_type {
    auto operator[](auto p) const {
        struct Tag {};
        return x3::rule<Tag, T>{"as"} = p;
    }
};

template <typename T> static inline as_type<T> as{};

Note that some (most?) of your comment lines

// synthesize: std::string
// synthesize: double
// synthesize: std::optional<std::string>
// synthesize: std::pair< std::vector<...>, std::optional<std::string> >
// synthesized attribute: std::variant<node, std::optional<std::string>
// synthesize: std::pair<node, std::optional<double>>
// synthesize: std::pair<node, std::optional<double>>

are not very accurate. E.g. alpha >> *alnum has fusion::deque<char, std::string> > as attribute type. Mind you, the propagation machinery may be able to iron out the subtle difference when instantiating the x3::move_to to the actual bound reference.

That actually turns out to be the case, see Simplify, Cleanup below

But as is, this ripples down: leaf isn't optional<string> now and so on.

Since you defined branch without attribute type, its attribute type is actually x3::unused_type. This means that subtree doesn't contain a vector<...> either. I had to check, so made a facility:

template <typename ExpectedAttributeType> void check(auto p) {
    static_assert(
        std::is_same_v<
            ExpectedAttributeType,
            typename x3::traits::attribute_of<decltype(p), x3::unused_type>::type>);
};

And now we can see that:

check<std::string>(name); // assuming a fixed `name` rule that coerces std::string
check<std::vector<std::string>>('(' >> name % ',' >> ')');

But for a parser that exposes unused_type:

check<x3::unused_type>(x3::omit[name]);
check<std::vector<x3::unused_type>>('(' >> x3::omit[name] % ',' >> ')'); // FAILS!
check<x3::unused_type>('(' >> x3::omit[name] % ',' >> ')'); // Passes

See what I mean with keeping close control? The ripple-effect makes for wildly different attribute types from what you expected.

In addition, pair<node, optional<double>> would not be compatible with any of your AST types (which happens to be just ast::node of course).

Show, Don't Tell

As I know you to be a fast learner, and I don't know how to best explain the transformations I'd do step by step, I'll just show you the result of applying the above principles to the question code:

x3::rule<class branch, ast::node> node{"node"};

auto name     = as<std::string>[x3::lexeme[x3::alpha >> *x3::alnum]];
auto length   = as<double>[':' >> x3::double_];
auto leaf     = as<std::string>[name | x3::attr(std::string{})];
auto children = '(' >> (node % ',') >> ')' | x3::attr(ast::nodes{});
auto node_def = children >> leaf >> -length;
auto tree     = x3::skip(x3::blank)[node >> ';' >> x3::eoi];

BOOST_SPIRIT_DEFINE(node);

You can your expectations:

void checks() {
    check<double>(length);
    check<std::string>(leaf);
    check<std::string>(name);
    check<ast::nodes>('(' >> node % ',' >> ')'); // Passes
    //check<ast::nodes>(children); // FAILS, actually variant<ast::nodes, ast::nodes> (!!)
    check<ast::nodes>(as<ast::nodes>[children]); // Trivially compatible
    check<ast::node>(tree);
}

Note how subtle the inner mechanisms can be. To be completely fair, Spirit Qi used to be way more predictable and forgiving than X3. I suspect the difference might be in favor of maintainability and compilation times?_

Testing

Borrowing some test cases from Grammar fails parsing tree with Boost Spirit X3 and debug-printing the parsed AST:

Live On Coliru prints

============ running internal tests:
"(,)"    PASS -> node { [node { [], "" }, node { [], "" }], "" }
"(A,B)F"     PASS -> node { [node { [], "A" }, node { [], "B" }], "F" }
"(A:10,B:10)F"   PASS -> node { [node { [], "A":10 }, node { [], "B":10 }], "F" }
============ running tree tests:
";"  PASS -> node { [], "" }
"(,);"   PASS -> node { [node { [], "" }, node { [], "" }], "" }
"(,,(,));"   PASS -> node { [node { [], "" }, node { [], "" }, node { [node { [], "" }, node { [], "" }], "" }], "" }
"(A,B,(C,D));"   PASS -> node { [node { [], "A" }, node { [], "B" }, node { [node { [], "C" }, node { [], "D" }], "" }], "" }
"(A,B,(C,D)E)F;"     PASS -> node { [node { [], "A" }, node { [], "B" }, node { [node { [], "C" }, node { [], "D" }], "E" }], "F" }
"(:0.1,:0.2,(:0.3,:0.4):0.5);"   PASS -> node { [node { [], "":0.1 }, node { [], "":0.2 }, node { [node { [], "":0.3 }, node { [], "":0.4 }], "":0.5 }], "" }
"(:0.1,:0.2,(:0.3,:0.4):0.5):0.0;"   PASS -> node { [node { [], "":0.1 }, node { [], "":0.2 }, node { [node { [], "":0.3 }, node { [], "":0.4 }], "":0.5 }], "":0 }
"(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);"   PASS -> node { [node { [], "A":0.1 }, node { [], "B":0.2 }, node { [node { [], "C":0.3 }, node { [], "D":0.4 }], "":0.5 }], "" }
"(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;"     PASS -> node { [node { [], "A":0.1 }, node { [], "B":0.2 }, node { [node { [], "C":0.3 }, node { [], "D":0.4 }], "E":0.5 }], "F" }
"((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A;"    PASS -> node { [node { [node { [], "B":0.2 }, node { [node { [], "C":0.3 }, node { [], "D":0.4 }], "E":0.5 }], "F":0.1 }], "A" }

Simplify, Cleanup

After all is done, I usually go in and remove unnecessary rule instantiation.

Turns out, unsurprisingly, that indeed matching the AST to your rules 1:1 makes everything trivially compatible without coercing. In your case I'd probably also remove the optional<double>, so let me show you in case you're interested:

struct node {
    nodes       children;
    std::string name;
    double      length;
};

And the rules become:

x3::rule<class branch, ast::node> node{"node"};

auto name     = x3::lexeme[x3::alpha >> *x3::alnum];
auto length   = ':' >> x3::double_ | x3::attr(0.0);
auto leaf     = name | x3::attr(std::string{});
auto children = '(' >> (node % ',') >> ')' | x3::attr(ast::nodes{});
auto node_def = children >> leaf >> -length;
auto tree     = x3::skip(x3::blank)[node >> ';' >> x3::eoi];

BOOST_SPIRIT_DEFINE(node);

For good measure let's reword the checks to only verify the expected compatibilities:

template <typename ExpectedAttributeType> void compatible(auto p) {
    static_assert(
        std::is_same_v<ExpectedAttributeType,
                       typename x3::traits::attribute_of<
                           decltype(x3::rule<struct _, ExpectedAttributeType>{} = p),
                           x3::unused_type>::type>);
};

void checks() {
    compatible<double>(length);
    compatible<std::string>(leaf);
    compatible<std::string>(name);
    compatible<ast::nodes>(children);
    compatible<ast::node>(tree);
}

Full Listing

Live On Coliru

#include <boost/fusion/include/adapted.hpp>
#include <boost/spirit/home/x3.hpp>
#include <iomanip>
#include <iostream>

namespace ast {
    using nodes = std::vector<struct node>;
    struct node {
        nodes       children;
        std::string name;
        double      length;
    };

    // for debug output
    static inline std::ostream& operator<<(std::ostream& os, node const& n) {
        os << "node { [";

        for (auto sep = ""; auto& c : n.children)
            os << std::exchange(sep, ", ") << c;

        os << "], " << quoted(n.name) ;

        if (n.length != 0)
            os << ":" << n.length;
        return os << " }";
    }
} // namespace ast

BOOST_FUSION_ADAPT_STRUCT(ast::node, children, name, length)

#include <boost/fusion/include/vector.hpp>
#include <boost/fusion/include/std_pair.hpp>

namespace parser {
    namespace x3 = boost::spirit::x3;

    static x3::rule<class branch, ast::node> const node{"node"};

    static auto const name     = x3::lexeme[x3::alpha >> *x3::alnum];
    static auto const length   = ':' >> x3::double_ | x3::attr(0.0);
    static auto const leaf     = name | x3::attr(std::string{});
    static auto const children = '(' >> (node % ',') >> ')' | x3::attr(ast::nodes{});
    static auto const node_def = children >> leaf >> -length;
    static auto const tree     = x3::skip(x3::blank)[node >> ';' >> x3::eoi];

    BOOST_SPIRIT_DEFINE(node);

    template <typename ExpectedAttributeType> void compatible(auto p) {
        static_assert(
            std::is_same_v<ExpectedAttributeType,
                           typename x3::traits::attribute_of<
                               decltype(x3::rule<struct _, ExpectedAttributeType>{} = p),
                               x3::unused_type>::type>);
    };

    void checks() {
        compatible<double>(length);
        compatible<std::string>(leaf);
        compatible<std::string>(name);
        compatible<ast::nodes>(children);
        compatible<ast::node>(tree);
    }
} // namespace parser

namespace test {
    void run_tests(auto name, auto p, std::initializer_list<char const*> cases) {
        std::cerr << "============ running " << name << " tests:\n";
        for (std::string const input : cases)
        {
            ast::node n;
            auto      ok = parse(begin(input), end(input), p, n);

            std::cout << quoted(input) << "\t " << (ok ? "PASS" : "FAIL");
            if (ok)
                std::cout << " -> " << n << std::endl;
            else
                std::cout << std::endl;
        }
    }

    void internal() {
        run_tests("internal", parser::node,
                  {
                      "(,)",
                      "(A,B)F",
                      "(A:10,B:10)F",
                  });
    }

    void tree() {
        run_tests("tree", parser::tree,
            {
                ";",
                "(,);",
                "(,,(,));",
                "(A,B,(C,D));",
                "(A,B,(C,D)E)F;",
                "(:0.1,:0.2,(:0.3,:0.4):0.5);",
                "(:0.1,:0.2,(:0.3,:0.4):0.5):0.0;",
                "(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);",
                "(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;",
                "((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A;",
            });
    }
} // namespace test

int main() {
    test::internal();
    test::tree();
}

With identical output (except for the suppressed 0.0 length in the debug output).

¹ In fairness in X3 I'll more happily consider it as writing them has become a lot more natural.

I suspect that you might want to distinguish "internal"/"leaf" nodes more than my code does. If parts of `node` is exclusive for leaf/non-leaf nodes you will obviously want to reflect that. I would observe that this means that _"I got a nice grammar, and a nice ast"_ wasn't accurate then. An AST that reflects reality would then be `node = variant` or something. You can *simplify* the AST representation to avoid the variant, but you would still have two `rule<..., ast::node>` that have different forms. Until we know what the actual grammar/semantics are, I will not assume more. — sehe, Nov 21 '22 at 16:28
Another thing I glossed over a bit is the trick of using `p | attr(default_value)` idiom instead of `-p` to get easier mapping from rule to AST in places. — sehe, Nov 21 '22 at 16:30
Wahoo @sehe, ensuring spirit customer service since 2013! :D Thank you, it helps a lot. I love the trick to test the expected compatibilities. And I loooove how everything falls into place at the end: simple and neat. I guess this is paid by the complexity of learning the library, but declaring a parsing grammar+ast in ~20 lines is amazing (and worth the sweat!). — WaterFox, Nov 21 '22 at 18:02
One last question though: I guess this is NOT the role of the AST to serve as a final data structure, right? It is meant to be compatible with the grammar, and just hold an accurate enough representation of the data before to be converted towards another construct more suitable to data treatment? — WaterFox, Nov 21 '22 at 18:09
There are no normative answers here. You can't say "AST is supposed to serve as a final data structure" nor "AST is meant to be compatible with the grammar". I tried to convey this was my preference, or methodology, of sorts. These are merely choices. My approach is based on what leads to most friction-lesss grammars (I try to sit in the sweet spot of the library). However, you might have a good reason to deviate, in which case you probably want to use `transform_attribute<>` or use semantic actions to get your bespoke data structures to work. — sehe, Nov 21 '22 at 21:46
"but declaring a parsing grammar+ast in ~20 lines is amazing (and worth the sweat!)" - yes. This is the reason Spirit is my weapon of choice for most small parsing tasks. But I would be the last person to claim it is easy or user-friendly. It's selling point is power and productivity if you know how to use it :) — sehe, Nov 21 '22 at 21:51
And that's why newcomers like me can be so grateful to have people like you teaching/showcasing this library! :clap: :clap: — WaterFox, Nov 21 '22 at 21:57