1

I'm new to C++ and first time using Boost Spirit taking on a task for my team to learn and work with C++ (coming from web developer background :)). Searching from the internet, I saw some great examples from this community (especially from Sehe) but can't quite piece all things together to achieve this task due to the complication of the XML structure.

This parser will act as the middle man to translate structure code definition (written by some other teams) to XML for multiple integration teams to use and generate code from it to the language of their choices base on the XML structure.

Below is a small example of the code structure definition text (from external file). This file could be very large depending on the task

Class Simple caption;
Class Simple columns "Column Name";

Class Container CONTAINER_NAME ( 
  Complex OBJECT_NAME ( 
    Simple obj_id 
    Simple obj_property1
    Simple obj_attribute enumeration(EnumOption1, EnumOption2,EnumOption3,EnumOption4)
    Container OBJECT_ITEMS (
      Complex OBJECT_ITEM (
        Simple obj_item_name
        Container set_value (
          Simple obj_item_value
        )
      )
    )
  )
);

The parser will evaluate and produce XML in this format

<task>
  <class>
    <simple>
      <identifier>caption</identifier>
      <literal>" "</literal>
    </simple>
  </class>
  <class>
    <simple>
      <identifier>caption</identifier>
      <literal>"Column Name"</literal>
    </simple>
  </class>
  <class>
    <container>
      <identifier>CONTAINER_NAME:CONTAINER_NAME</identifier>
      <literal>" "</literal>
      <complex>
        <identifier>CONTAINER_NAME:OBJECT_NAME</identifier>
        <literal>" "</literal>
        <simple>
          <identifier>CONTAINER_NAME:obj_id</identifier>
          <literal>" "</literal>
        </simple>
        <simple>
          <identifier>CONTAINER_NAME:obj_property1</identifier>
          <literal>" "</literal>
        </simple>
        <simple>
          <identifier>CONTAINER_NAME:obj_attribute</identifier>
          <literal>" "</literal>
          <enumeration>
            <word>EnumOption1</word>
            <word>EnumOption2</word>
            <word>EnumOption3</word>
            <word>EnumOption4</word>
          </enumeration>
        </simple>
        <container>
          <identifier>CONTAINER_NAME:OBJECT_ITEMS</identifier>
          <literal>" "</literal>
          <complex>
            <identifier>CONTAINER_NAME:OBJECT_ITEM</identifier>
            <literal>" "</literal>
            <simple>
              <identifier>CONTAINER_NAME:obj_item_name</identifier>
              <literal>" "</literal>
            </simple>
            <container>
              <identifier>CONTAINER_NAME:set_value</identifier>
              <literal>" "</literal>
              <simple>
                <identifier>CONTAINER_NAME:obj_item_value</identifier>
                <literal>" "</literal>
              </simple>
            </container>
          </complex>
        </container>
      </complex>
    </container>
  </class>
</task>

From what I've read, I will need (just my thought process with a very basic knowledge of this) the following:

  1. Grammar definition with rules for Class, Container, Complex, Simple, to parse the code definition text (my biggest challenge);
  2. Some kind of semantic actions/functions to create XML node for each group (Simple, complex, container, class, etc.). I see that I can use msxml6.dll here for xml generator, but can't figure out how to go hook them in.

I saw a few examples to construct AST then build XML from it but the XML structure they use is not quite follow any standard as Container can have Complex, but Complex can also have Container

Any help or instruction or example to point me to where to begin would be greatly appreciate.

UPDATED

  1. Semicolon is used to indicate the end of CLASS block.
  2. Comment exists but will be on separate line. No inline comment.
  3. There is no literal tag in code definition. literal content is inside doublequote. See updated code definition structure block line #2.
Dylan
  • 121
  • 7
  • What is the input grammar. Exactly? Your example is way too sloppy (are `//` comments included? Why do we often have `;` but often not? Is the enumeration part of the simple? It looks like that, but I'd expect at most a "literal" there. Why does the XML not match the sample input? It's impossible for me to figure out the intended outcome. – sehe May 31 '23 at 23:17
  • Also, consider rolling a Perl script or something instead :) – sehe May 31 '23 at 23:18
  • The semicolon indicate the end of a class block. There is no comment. The one you see in the example is to let you know there could be a lot more Simples in this class. I did check the XML and it is correct. I’m thinking of compile this parser as DLL so that it can be called from other languages too. – Dylan Jun 01 '23 at 05:30

2 Answers2

1

Okay, the explanations helped me realize the correspondence between the input and the XML. There's still a number of ... unclear specs, but let's roll with it.

Parsing


  1. AST

    As always, I start out with the AST. This time instead of basing it on the sample input, it was easier to base it on the output XML:

    namespace Ast {
        using boost::recursive_wrapper;
    
        using Id      = std::string;
        using Literal = std::string;
        using Enum    = std::vector<Id>;
    
        struct Base {
            Id      id;
            Literal literal;
        };
    
        struct Simple : Base {
            Enum enumeration;
        };
    
        struct Complex;
        struct Container;
    
        using Class = boost::variant<   
            Simple,                     
            recursive_wrapper<Complex>, 
            recursive_wrapper<Container>
        >;
    
        using Classes = std::vector<Class>;
        struct Container : Base { Class   element; };
        struct Complex   : Base { Classes members; };
    
        using Task = std::vector<Class>;
    } // namespace Ast
    

    So far so good. No surprises. The main thing is using recursive variants to allow nesting complex/container types. As a side note I reflected the common parts of all types as Base. Let's adapt these for use as Fusion sequences:

    BOOST_FUSION_ADAPT_STRUCT(Ast::Simple,    id, literal, enumeration);
    BOOST_FUSION_ADAPT_STRUCT(Ast::Complex,   id, literal, members)
    BOOST_FUSION_ADAPT_STRUCT(Ast::Container, id, literal, element)
    

    Now Spirit will know how to propagate attributes without further help.

  2. Grammar

    The skeleton is easy, just mapping AST nodes to rules:

    template <typename It> struct Task : qi::grammar<It, Ast::Task()> {
        Task() : Task::base_type(start) {
            start = skip(space)[task_];
            // ...
        }
    
      private:
        qi::rule<It, Ast::Task()> start;
    
        using Skipper = qi::space_type;
        qi::rule<It, Ast::Task(), Skipper>      task_;
        qi::rule<It, Ast::Class(), Skipper>     class_;
        qi::rule<It, Ast::Simple(), Skipper>    simple_;
        qi::rule<It, Ast::Complex(), Skipper>   complex_;
        qi::rule<It, Ast::Container(), Skipper> container_;
    
        // lexemes:
        qi::rule<It, Ast::Id()>      id_;
        qi::rule<It, Ast::Literal()> literal_;
    };
    

    Note I grouped the lexemes (that do not allow a skipper) and encapsulated the space skipper into the start rule.

    Because "classes" can appear explicitly, but also without the leading Class keyword, I will introduce an extra rule type_ so we can say:

        task_  = *class_ > eoi;
        type_  = simple_ | complex_ | container_;
        class_ = "Class" > type_ > ';';
    

    And also use type_ where Simple/Complex/Container is acceptable.

    For the rest, there aren't many surprises, so let's show the whole constructor block:

    Task() : Task::base_type(start) {
        using namespace qi;
    
        start = skip(space)[task_];
    
        // lexemes:
        id_      = raw[alpha >> *('_' | alnum)];
        literal_ = '"' > *('\\' >> char_ | ~char_('"')) > '"';
    
        auto optlit = copy(literal_ | attr(std::string(" "))); // weird, but okay
    
        task_      = *class_ > eoi;
        type_      = simple_ | complex_ | container_;
        class_     = lit("Class") > type_ > ';';
        simple_    = lit("Simple") >> id_ >> optlit >> enum_;
        complex_   = lit("Complex") >> id_ >> optlit >> '(' >> *type_ >> ')';
        container_ = lit("Container") >> id_ >> optlit >> '(' >> type_ > ')';
        enum_      = -(lit("enumeration") >> '(' >> (id_ % ',') > ')');
    
        BOOST_SPIRIT_DEBUG_NODES(
            (task_)(class_)(type_)(simple_)(complex_)(container_)(enum_)(id_)(literal_))
    }
    

    Note the other "extra" (enum_). Of course, I could have kept it all in the simple_ rule instead.

    Here's a Live Demo printing the raw AST for the sample input:

     - (caption " " {})
     - (columns "Column Name" {})
     - (CONTAINER_NAME " " (OBJECT_NAME " " {(obj_id " " {}), (obj_property1 " " {}), (obj_attribute " " {EnumOption1, EnumOption2, EnumOption3, EnumOption4}), (OBJECT_ITEMS " " (OBJECT_ITEM " " {(obj_item_name " " {}), (set_value " " (obj_item_value " " {}))}))}))
    

    It's just a shame that all my pretty error handling code is not firing :) The output is obviously pretty ugly, so let's fix that.

Generating XML


I'm not a Microsoft fan, and prefer other libraries for XML anyways (see What XML parser should I use in C++?).

So I'll choose PugiXML here.

  1. Generator

    Simply put, we have to teach the computer how to convert any Ast node into XML:

    #include <pugixml.hpp>
    namespace Generate {
        using namespace Ast;
    
        struct XML {
            using Node = pugi::xml_node;
    
            // callable for variant visiting:
            template <typename T> void operator()(Node parent, T const& node) const { apply(parent, node); }
    
          private:
            void apply(Node parent, Ast::Class const& c) const {
                using std::placeholders::_1;
                boost::apply_visitor(std::bind(*this, parent, _1), c);
            }
    
            void apply(Node parent, Id const& id) const {
                auto identifier = named_child(parent, "identifier");
                identifier.text().set(id.c_str());
            }
    
            void apply(Node parent, Literal const& l) const {
                auto literal = named_child(parent, "literal");
                literal.text().set(l.c_str());
            }
    
            void apply(Node parent, Simple const& s) const {
                auto simple = named_child(parent, "simple");
                apply(simple, s.id);
                apply(simple, s.literal);
                apply(simple, s.enumeration);
            }
    
            void apply(Node parent, Enum const& e) const {
                if (!e.empty()) {
                    auto enum_ = named_child(parent, "enumeration");
                    for (auto& v : e)
                        named_child(enum_, "word").text().set(v.c_str());
                }
            }
    
            void apply(Node parent, Complex const& c) const {
                auto complex_ = named_child(parent, "complex");
                apply(complex_, c.id);
                apply(complex_, c.literal);
                for (auto& m : c.members)
                    apply(complex_, m);
            }
    
            void apply(Node parent, Container const& c) const {
                auto cont = named_child(parent, "container");
                apply(cont, c.id);
                apply(cont, c.literal);
                apply(cont, c.element);
            }
    
            void apply(Node parent, Task const& t) const {
                auto task = named_child(parent, "task");
                for (auto& c : t)
                    apply(task, c);
            }
    
          private:
            Node named_child(Node parent, std::string const& name) const {
                auto child = parent.append_child();
                child.set_name(name.c_str());
                return child;
            }
        };
    } // namespace Generate
    

    I'm not gonna say I typed this up error-free in a jiffy, but you'll recognize the pattern: It's following the Ast 1:1 to great success.

FULL DEMO


Integrating all the above, and printing the XML output:

Live On Compiler Explorer

// #define BOOST_SPIRIT_DEBUG 1
#include <boost/fusion/adapted.hpp>
#include <boost/spirit/include/qi.hpp>
#include <iomanip>
namespace qi = boost::spirit::qi;

namespace Ast {
    using boost::recursive_wrapper;

    using Id      = std::string;
    using Literal = std::string;
    using Enum    = std::vector<Id>;

    struct Base {
        Id      id;
        Literal literal;
    };

    struct Simple : Base {
        Enum enumeration;
    };

    struct Complex;
    struct Container;

    using Class = boost::variant<    //
        Simple,                      //
        recursive_wrapper<Complex>,  //
        recursive_wrapper<Container> //
    >;

    using Classes = std::vector<Class>;
    struct Container : Base { Class   element; };
    struct Complex   : Base { Classes members; };

    using Task = std::vector<Class>;
} // namespace Ast

BOOST_FUSION_ADAPT_STRUCT(Ast::Simple,    id, literal, enumeration);
BOOST_FUSION_ADAPT_STRUCT(Ast::Complex,   id, literal, members)
BOOST_FUSION_ADAPT_STRUCT(Ast::Container, id, literal, element)

namespace Parser {
    template <typename It> struct Task : qi::grammar<It, Ast::Task()> {
        Task() : Task::base_type(start) {
            using namespace qi;

            start = skip(space)[task_];

            // lexemes:
            id_      = raw[alpha >> *('_' | alnum)];
            literal_ = '"' > *('\\' >> char_ | ~char_('"')) > '"';

            auto optlit = copy(literal_ | attr(std::string(" "))); // weird, but okay

            task_      = *class_ > eoi;
            type_      = simple_ | complex_ | container_;
            class_     = lit("Class") > type_ > ';';
            simple_    = lit("Simple") >> id_ >> optlit >> enum_;
            complex_   = lit("Complex") >> id_ >> optlit >> '(' >> *type_ >> ')';
            container_ = lit("Container") >> id_ >> optlit >> '(' >> type_ > ')';
            enum_      = -(lit("enumeration") >> '(' >> (id_ % ',') > ')');

            BOOST_SPIRIT_DEBUG_NODES(
                (task_)(class_)(type_)(simple_)(complex_)(container_)(enum_)(id_)(literal_))
        }

      private:
        qi::rule<It, Ast::Task()> start;

        using Skipper = qi::space_type;
        qi::rule<It, Ast::Task(), Skipper>      task_;
        qi::rule<It, Ast::Class(), Skipper>     class_, type_;
        qi::rule<It, Ast::Simple(), Skipper>    simple_;
        qi::rule<It, Ast::Complex(), Skipper>   complex_;
        qi::rule<It, Ast::Container(), Skipper> container_;
        qi::rule<It, Ast::Enum(), Skipper>      enum_;

        // lexemes:
        qi::rule<It, Ast::Id()>      id_;
        qi::rule<It, Ast::Literal()> literal_;
    };
}

#include <pugixml.hpp>
namespace Generate {
    using namespace Ast;

    struct XML {
        using Node = pugi::xml_node;

        // callable for variant visiting:
        template <typename T> void operator()(Node parent, T const& node) const { apply(parent, node); }

      private:
        void apply(Node parent, Ast::Class const& c) const {
            using std::placeholders::_1;
            boost::apply_visitor(std::bind(*this, parent, _1), c);
        }

        void apply(Node parent, std::string const& s, char const* kind) const {
            named_child(parent, kind).text().set(s.c_str());
        }

        void apply(Node parent, Simple const& s) const {
            auto simple = named_child(parent, "simple");
            apply(simple, s.id, "identifier");
            apply(simple, s.literal, "literal");
            apply(simple, s.enumeration);
        }

        void apply(Node parent, Enum const& e) const {
            if (!e.empty()) {
                auto enum_ = named_child(parent, "enumeration");
                for (auto& v : e)
                    named_child(enum_, "word").text().set(v.c_str());
            }
        }

        void apply(Node parent, Complex const& c) const {
            auto complex_ = named_child(parent, "complex");
            apply(complex_, c.id, "identifier");
            apply(complex_, c.literal, "literal");
            for (auto& m : c.members)
                apply(complex_, m);
        }

        void apply(Node parent, Container const& c) const {
            auto cont = named_child(parent, "container");
            apply(cont, c.id, "identifier");
            apply(cont, c.literal, "literal");
            apply(cont, c.element);
        }

        void apply(Node parent, Task const& t) const {
            auto task = named_child(parent, "task");
            for (auto& c : t)
                apply(task.append_child("class"), c);
        }

      private:
        Node named_child(Node parent, std::string const& name) const {
            auto child = parent.append_child();
            child.set_name(name.c_str());
            return child;
        }
    };
} // namespace Generate

int main() { 
    using It = std::string_view::const_iterator;
    static const Parser::Task<It> p;
    static const Generate::XML to_xml;

    for (std::string_view input :
         {
             R"(Class Simple caption;
                Class Simple columns "Column Name";

                Class Container CONTAINER_NAME ( 
                  Complex OBJECT_NAME ( 
                    Simple obj_id 
                    Simple obj_property1
                    Simple obj_attribute enumeration(EnumOption1, EnumOption2,EnumOption3,EnumOption4)
                    Container OBJECT_ITEMS (
                      Complex OBJECT_ITEM (
                        Simple obj_item_name
                        Container set_value (
                          Simple obj_item_value
                        )
                      )
                    )
                  )
                );)",
         }) //
    {
        try {
            Ast::Task t;

            if (qi::parse(begin(input), end(input), p, t)) {
                pugi::xml_document doc;
                to_xml(doc.root(), t);
                doc.print(std::cout, "  ", pugi::format_default);
                std::cout << std::endl;
            } else {
                std::cout << " -> INVALID" << std::endl;
            }
        } catch (qi::expectation_failure<It> const& ef) {
            auto f    = begin(input);
            auto p    = ef.first - input.begin();
            auto bol  = input.find_last_of("\r\n", p) + 1;
            auto line = std::count(f, f + bol, '\n') + 1;
            auto eol  = input.find_first_of("\r\n", p);

            std::cerr << " -> EXPECTED " << ef.what_ << " in line:" << line << "\n"
                << input.substr(bol, eol - bol) << "\n"
                << std::setw(p - bol) << ""
                << "^--- here" << std::endl;
        }
    }
}

Printing the coveted output:

<task>
  <class>
    <simple>
      <identifier>caption</identifier>
      <literal> </literal>
    </simple>
  </class>
  <class>
    <simple>
      <identifier>columns</identifier>
      <literal>Column Name</literal>
    </simple>
  </class>
  <class>
    <container>
      <identifier>CONTAINER_NAME</identifier>
      <literal> </literal>
      <complex>
        <identifier>OBJECT_NAME</identifier>
        <literal> </literal>
        <simple>
          <identifier>obj_id</identifier>
          <literal> </literal>
        </simple>
        <simple>
          <identifier>obj_property1</identifier>
          <literal> </literal>
        </simple>
        <simple>
          <identifier>obj_attribute</identifier>
          <literal> </literal>
          <enumeration>
            <word>EnumOption1</word>
            <word>EnumOption2</word>
            <word>EnumOption3</word>
            <word>EnumOption4</word>
          </enumeration>
        </simple>
        <container>
          <identifier>OBJECT_ITEMS</identifier>
          <literal> </literal>
          <complex>
            <identifier>OBJECT_ITEM</identifier>
            <literal> </literal>
            <simple>
              <identifier>obj_item_name</identifier>
              <literal> </literal>
            </simple>
            <container>
              <identifier>set_value</identifier>
              <literal> </literal>
              <simple>
                <identifier>obj_item_value</identifier>
                <literal> </literal>
              </simple>
            </container>
          </complex>
        </container>
      </complex>
    </container>
  </class>
</task>

I still don't unserstand how the CONTAINER_NAME: "namespacing" works, so I'll leave that to you to get right.

sehe
  • 374,641
  • 47
  • 450
  • 633
  • Dropping some of the ballast only added for raw AST output, no longer needed in the XML-ified version: https://compiler-explorer.com/z/vhzjW8Wb1 – sehe Jun 01 '23 at 21:18
  • Wow, with details explanation of how the whole thing come about. You are truly boost expert. I will read more tonight to study these technique. Thanks so much for this awesome lesson – Dylan Jun 02 '23 at 01:08
  • Hi Sehe, the output XML is missing the tag. Should be ``` caption " " ``` I'll try to fill that in with my limited knowledge. Hopefully I can get it to work. Thanks again. – Dylan Jun 02 '23 at 16:55
  • @Dylan No doubt you figured something out. In case you didn't, here goes: https://compiler-explorer.com/z/9a17K4Gdf – sehe Jun 03 '23 at 01:38
0

Thanks again for this great lesson. To answer your question about the CONTAINER_NAME: namespace, it just simply for grouping (not my rule, just the folks who come up with the definition structure want it that way).

So if we parse this line

Class Simple caption;

then the out come should be:

<task>
  <class>
    <simple>
      <identifier>caption:caption</identifier>
      <literal>" "</literal>
    </simple>
  </class>
</task>

The namespace caption: is added since this is the first child of this class. But if we are parsing

Class Container CONTAINER_NAME ( 
  Complex OBJECT_NAME ( 
    Simple obj_id 
    Container OBJECT_ITEMS (
      Complex OBJECT_ITEM (
        Simple obj_item_name
        Container set_value (
          Simple obj_item_value
        )
      )
    )
  )
);

Then the CONTAINER_NAME: namespace will be appended to all chidren's identifier name.

<class>
    <container>
      <identifier>CONTAINER_NAME:CONTAINER_NAME</identifier>
      <literal> </literal>
      <complex>
        <identifier>CONTAINER_NAME:OBJECT_NAME</identifier>
        <literal> </literal>
        <simple>
          <identifier>CONTAINER_NAME:obj_id</identifier>
          <literal> </literal>
        </simple>
        <container>
          <identifier>CONTAINER_NAME:OBJECT_ITEMS</identifier>
          <literal> </literal>
          <complex>
            <identifier>CONTAINER_NAME:OBJECT_ITEM</identifier>
            <literal> </literal>
            <simple>
              <identifier>CONTAINER_NAME:obj_item_name</identifier>
              <literal> </literal>
            </simple>
            <container>
              <identifier>CONTAINER_NAME:set_value</identifier>
              <literal> </literal>
              <simple>
                <identifier>CONTAINER_NAME:obj_item_value</identifier>
                <literal> </literal>
              </simple>
            </container>
          </complex>
        </container>
      </complex>
    </container>
  </class>

I add the following function to handle the namespace to XML struct. It did the job but I'm pretty sure you will come up with just one line to do this...:)

std::string get_namespace(Node parent, std::string const& ident) const {
  auto parent_name = std::string(parent.name());
  std::string ns = ident + ":" + ident;  // Default namespace
  // If this is the child of class container, just return the object's identifier value and add colon (:)
  if (parent_name != "class") {
    // Parent is not a class type, just extract the namespace from
    // identifier of this parent node.
    std::string parent_id = parent.child("identifier").text().as_string();
    ns = parent_id.substr(0, parent_id.find(":") + 1) + ident;
  }
  return ns;
};

Then I just call this function when handling XML for Simple, Complex, and Container

void apply(Node parent, Simple const& s) const {
  auto simple = named_child(parent, "simple");
  apply(simple, get_namespace(parent, s.id), "identifier");
  apply(simple, s.literal, "literal");
  apply(simple, s.enumeration);
}

Anyway, there's a lot more for me to do as I need to also parse if-else, case statements but this give me a great starting point. Again, thanks for taking time sharing your knowledge with me.

Dylan
  • 121
  • 7