11

Lets say I have code like this (line numbers for reference):

1:
2:function FuncName_1 {
3:    var Var_1 = 3;
4:    var  Var_2 = 4;
5:    ...

I want to write a grammar that parses such text, puts all indentifiers (function and variable names) infos into a tree (utree?). Each node should preserve: line_num, column_num and symbol value. example:

root: FuncName_1 (line:2,col:10)
  children[0]: Var_1 (line:3, col:8)
  children[1]: Var_1 (line:4, col:9)

I want to put it into the tree because I plan to traverse through that tree and for each node I must know the 'context': (all parent nodes of current nodes).

E.g, while processing node with Var_1, I must know that this is a name for local variable for function FuncName_1 (that is currently being processed as node, but one level earlier)

I cannot figure out few things

  1. Can this be done in Spirit with semantic actions and utree's ? Or should I use variant<> trees ?
  2. How to pass to the node those three informations (column,line,symbol_name) at the same time ? I know I must use pos_iterator as iterator type for grammar but how to access those information in sematic action ?

I'm a newbie in Boost so I read the Spirit documentaiton over and over, I try to google my problems but I somehow cannot put all the pieces together ot find the solution. Seems like there was no one me with such use case like mine before (or I'm just not able to find it) Looks like the only solutions with position iterator are the ones with parsing error handling, but this is not the case I'm interested in. The code that only parses the code I was taking about is below but I dont know how to move forward with it.

  #include <boost/spirit/include/qi.hpp>
  #include <boost/spirit/include/support_line_pos_iterator.hpp>

  namespace qi = boost::spirit::qi;
  typedef boost::spirit::line_pos_iterator<std::string::const_iterator> pos_iterator_t;

  template<typename Iterator=pos_iterator_t, typename Skipper=qi::space_type>
  struct ParseGrammar: public qi::grammar<Iterator, Skipper>
  {
        ParseGrammar():ParseGrammar::base_type(SourceCode)
        {
           using namespace qi;
           KeywordFunction = lit("function");
           KeywordVar    = lit("var");
           SemiColon     = lit(';');

           Identifier = lexeme [alpha >> *(alnum | '_')];
           VarAssignemnt = KeywordVar >> Identifier >> char_('=') >> int_ >> SemiColon;
           SourceCode = KeywordFunction >> Identifier >> '{' >> *VarAssignemnt >> '}';
        }

        qi::rule<Iterator, Skipper> SourceCode;
        qi::rule<Iterator > KeywordFunction;
        qi::rule<Iterator,  Skipper> VarAssignemnt;
        qi::rule<Iterator> KeywordVar;
        qi::rule<Iterator> SemiColon;
        qi::rule<Iterator > Identifier;
  };

  int main()
  {
     std::string const content = "function FuncName_1 {\n var Var_1 = 3;\n var  Var_2 = 4; }";

     pos_iterator_t first(content.begin()), iter = first, last(content.end());
     ParseGrammar<pos_iterator_t> resolver;    //  Our parser
     bool ok = phrase_parse(iter,
                            last,
                            resolver,
                            qi::space);

     std::cout << std::boolalpha;
     std::cout << "\nok : " << ok << std::endl;
     std::cout << "full   : " << (iter == last) << std::endl;
     if(ok && iter == last)
     {
        std::cout << "OK: Parsing fully succeeded\n\n";
     }
     else
     {
        int line   = get_line(iter);
        int column = get_column(first, iter);
        std::cout << "-------------------------\n";
        std::cout << "ERROR: Parsing failed or not complete\n";
        std::cout << "stopped at: " << line  << ":" << column << "\n";
        std::cout << "remaining: '" << std::string(iter, last) << "'\n";
        std::cout << "-------------------------\n";
     }
     return 0;
  }
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Re.: `utree` is largely deemed a nice experiment, but no longer receives much support from the spirit devs: _["... The allure is that there's one and only one AST type to deal with, but that is also its weakness: a one-size-fits all approach is almost always not good." (more context on the \[spirit-general\] mailing list](http://boost.2283326.n4.nabble.com/Utree-tutorial-tp4652706p4652736.html))_ – sehe Oct 27 '13 at 16:39
  • Re.: The 'problem' of carrying around context during parser (e.g. for semantic analysis) my usual approach is to move that to a separate pass over the AST _after_ parsing (to prevent _encumbering_ the grammar definition with semantic actions). If you must, you can use `phx::ref()` and/or `qi::locals` (an [SO search](http://stackoverflow.com/search?tab=votes&q=qi%3a%3alocals) could turn up examples of use) – sehe Oct 27 '13 at 16:44

1 Answers1

15

This has been a fun exercise, where I finally put together a working demo of on_success[1] to annotate AST nodes.

Let's assume we want an AST like:

namespace ast
{
struct LocationInfo {
    unsigned line, column, length;
};

struct Identifier     : LocationInfo {
    std::string name;
};

struct VarAssignment  : LocationInfo {
    Identifier id;
    int value;
};

struct SourceCode     : LocationInfo {
    Identifier function;
    std::vector<VarAssignment> assignments;
};
}

I know, 'location information' is probably overkill for the SourceCode node, but you know... Anyways, to make it easy to assign attributes to these nodes without requiring semantic actions or lots of specifically crafted constructors:

#include <boost/fusion/adapted/struct.hpp>
BOOST_FUSION_ADAPT_STRUCT(ast::Identifier,    (std::string, name))
BOOST_FUSION_ADAPT_STRUCT(ast::VarAssignment, (ast::Identifier, id)(int, value))
BOOST_FUSION_ADAPT_STRUCT(ast::SourceCode,    (ast::Identifier, function)(std::vector<ast::VarAssignment>, assignments))

There. Now we can declare the rules to expose these attributes:

qi::rule<Iterator, ast::SourceCode(),    Skipper> SourceCode;
qi::rule<Iterator, ast::VarAssignment(), Skipper> VarAssignment;
qi::rule<Iterator, ast::Identifier()>         Identifier;
// no skipper, no attributes:
qi::rule<Iterator> KeywordFunction, KeywordVar, SemiColon;

We don't (essentially) modify the grammar, at all: attribute propagation is "just automatic"[2] :

KeywordFunction = lit("function");
KeywordVar      = lit("var");
SemiColon       = lit(';');

Identifier      = as_string [ alpha >> *(alnum | char_("_")) ];
VarAssignment   = KeywordVar >> Identifier >> '=' >> int_ >> SemiColon; 
SourceCode      = KeywordFunction >> Identifier >> '{' >> *VarAssignment >> '}';

The magic

How do we get the source location information attached to our nodes?

auto set_location_info = annotate(_val, _1, _3);
on_success(Identifier,    set_location_info);
on_success(VarAssignment, set_location_info);
on_success(SourceCode,    set_location_info);

Now, annotate is just a lazy version of a calleable that is defined as:

template<typename It>
struct annotation_f {
    typedef void result_type;

    annotation_f(It first) : first(first) {}
    It const first;

    template<typename Val, typename First, typename Last>
    void operator()(Val& v, First f, Last l) const {
        do_annotate(v, f, l, first);
    }
  private:
    void static do_annotate(ast::LocationInfo& li, It f, It l, It first) {
        using std::distance;
        li.line   = get_line(f);
        li.column = get_column(first, f);
        li.length = distance(f, l);
    }
    static void do_annotate(...) { }
};

Due to way in which get_column works, the functor is stateful (as it remembers the start iterator)[3]. As you can see do_annotate just accepts anything that derives from LocationInfo.

Now, the proof of the pudding:

std::string const content = "function FuncName_1 {\n var Var_1 = 3;\n var  Var_2 = 4; }";

pos_iterator_t first(content.begin()), iter = first, last(content.end());
ParseGrammar<pos_iterator_t> resolver(first);    //  Our parser

ast::SourceCode program;
bool ok = phrase_parse(iter,
        last,
        resolver,
        qi::space,
        program);

std::cout << std::boolalpha;
std::cout << "ok  : " << ok << std::endl;
std::cout << "full: " << (iter == last) << std::endl;
if(ok && iter == last)
{
    std::cout << "OK: Parsing fully succeeded\n\n";

    std::cout << "Function name: " << program.function.name << " (see L" << program.printLoc() << ")\n";
    for (auto const& va : program.assignments)
        std::cout << "variable " << va.id.name << " assigned value " << va.value << " at L" << va.printLoc() << "\n";
}
else
{
    int line   = get_line(iter);
    int column = get_column(first, iter);
    std::cout << "-------------------------\n";
    std::cout << "ERROR: Parsing failed or not complete\n";
    std::cout << "stopped at: " << line  << ":" << column << "\n";
    std::cout << "remaining: '" << std::string(iter, last) << "'\n";
    std::cout << "-------------------------\n";
}

This prints:

ok  : true
full: true
OK: Parsing fully succeeded

Function name: FuncName_1 (see L1:1:56)
variable Var_1 assigned value 3 at L2:3:14
variable Var_2 assigned value 4 at L3:3:15

Full Demo Program

See it Live On Coliru

Also showing:

  • error handling, e.g.:

    Error: expecting "=" in line 3: 
    
    var  Var_2 - 4; }
               ^---- here
    ok  : false
    full: false
    -------------------------
    ERROR: Parsing failed or not complete
    stopped at: 1:1
    remaining: 'function FuncName_1 {
    var Var_1 = 3;
    var  Var_2 - 4; }'
    -------------------------
    
  • BOOST_SPIRIT_DEBUG macros

  • A bit of a hacky way to conveniently stream the LocationInfo part of any AST node, sorry :)
//#define BOOST_SPIRIT_DEBUG
#define BOOST_SPIRIT_USE_PHOENIX_V3
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/include/support_line_pos_iterator.hpp>
#include <iomanip>

namespace qi = boost::spirit::qi;
namespace phx= boost::phoenix;

typedef boost::spirit::line_pos_iterator<std::string::const_iterator> pos_iterator_t;

namespace ast
{
    namespace manip { struct LocationInfoPrinter; }

    struct LocationInfo {
        unsigned line, column, length;
        manip::LocationInfoPrinter printLoc() const;
    };

    struct Identifier     : LocationInfo {
        std::string name;
    };

    struct VarAssignment  : LocationInfo {
        Identifier id;
        int value;
    };

    struct SourceCode     : LocationInfo {
        Identifier function;
        std::vector<VarAssignment> assignments;
    };

    ///////////////////////////////////////////////////////////////////////////
    // Completely unnecessary tweak to get a "poor man's" io manipulator going
    // so we can do `std::cout << x.printLoc()` on types of `x` deriving from
    // LocationInfo
    namespace manip {
        struct LocationInfoPrinter {
            LocationInfoPrinter(LocationInfo const& ref) : ref(ref) {}
            LocationInfo const& ref;
            friend std::ostream& operator<<(std::ostream& os, LocationInfoPrinter const& lip) {
                return os << lip.ref.line << ':' << lip.ref.column << ':' << lip.ref.length;
            }
        };
    }

    manip::LocationInfoPrinter LocationInfo::printLoc() const { return { *this }; }
    // feel free to disregard this hack
    ///////////////////////////////////////////////////////////////////////////
}

BOOST_FUSION_ADAPT_STRUCT(ast::Identifier,    (std::string, name))
BOOST_FUSION_ADAPT_STRUCT(ast::VarAssignment, (ast::Identifier, id)(int, value))
BOOST_FUSION_ADAPT_STRUCT(ast::SourceCode,    (ast::Identifier, function)(std::vector<ast::VarAssignment>, assignments))

struct error_handler_f {
    typedef qi::error_handler_result result_type;
    template<typename T1, typename T2, typename T3, typename T4>
        qi::error_handler_result operator()(T1 b, T2 e, T3 where, T4 const& what) const {
            std::cerr << "Error: expecting " << what << " in line " << get_line(where) << ": \n" 
                << std::string(b,e) << "\n"
                << std::setw(std::distance(b, where)) << '^' << "---- here\n";
            return qi::fail;
        }
};

template<typename It>
struct annotation_f {
    typedef void result_type;

    annotation_f(It first) : first(first) {}
    It const first;

    template<typename Val, typename First, typename Last>
    void operator()(Val& v, First f, Last l) const {
        do_annotate(v, f, l, first);
    }
  private:
    void static do_annotate(ast::LocationInfo& li, It f, It l, It first) {
        using std::distance;
        li.line   = get_line(f);
        li.column = get_column(first, f);
        li.length = distance(f, l);
    }
    static void do_annotate(...) {}
};

template<typename Iterator=pos_iterator_t, typename Skipper=qi::space_type>
struct ParseGrammar: public qi::grammar<Iterator, ast::SourceCode(), Skipper>
{
    ParseGrammar(Iterator first) : 
        ParseGrammar::base_type(SourceCode),
        annotate(first)
    {
        using namespace qi;
        KeywordFunction = lit("function");
        KeywordVar      = lit("var");
        SemiColon       = lit(';');

        Identifier      = as_string [ alpha >> *(alnum | char_("_")) ];
        VarAssignment   = KeywordVar > Identifier > '=' > int_ > SemiColon; // note: expectation points
        SourceCode      = KeywordFunction >> Identifier >> '{' >> *VarAssignment >> '}';

        on_error<fail>(VarAssignment, handler(_1, _2, _3, _4));
        on_error<fail>(SourceCode, handler(_1, _2, _3, _4));

        auto set_location_info = annotate(_val, _1, _3);
        on_success(Identifier,    set_location_info);
        on_success(VarAssignment, set_location_info);
        on_success(SourceCode,    set_location_info);

        BOOST_SPIRIT_DEBUG_NODES((KeywordFunction)(KeywordVar)(SemiColon)(Identifier)(VarAssignment)(SourceCode))
    }

    phx::function<error_handler_f> handler;
    phx::function<annotation_f<Iterator>> annotate;

    qi::rule<Iterator, ast::SourceCode(),    Skipper> SourceCode;
    qi::rule<Iterator, ast::VarAssignment(), Skipper> VarAssignment;
    qi::rule<Iterator, ast::Identifier()>             Identifier;
    // no skipper, no attributes:
    qi::rule<Iterator> KeywordFunction, KeywordVar, SemiColon;
};

int main()
{
    std::string const content = "function FuncName_1 {\n var Var_1 = 3;\n var  Var_2 - 4; }";

    pos_iterator_t first(content.begin()), iter = first, last(content.end());
    ParseGrammar<pos_iterator_t> resolver(first);    //  Our parser

    ast::SourceCode program;
    bool ok = phrase_parse(iter,
            last,
            resolver,
            qi::space,
            program);

    std::cout << std::boolalpha;
    std::cout << "ok  : " << ok << std::endl;
    std::cout << "full: " << (iter == last) << std::endl;
    if(ok && iter == last)
    {
        std::cout << "OK: Parsing fully succeeded\n\n";

        std::cout << "Function name: " << program.function.name << " (see L" << program.printLoc() << ")\n";
        for (auto const& va : program.assignments)
            std::cout << "variable " << va.id.name << " assigned value " << va.value << " at L" << va.printLoc() << "\n";
    }
    else
    {
        int line   = get_line(iter);
        int column = get_column(first, iter);
        std::cout << "-------------------------\n";
        std::cout << "ERROR: Parsing failed or not complete\n";
        std::cout << "stopped at: " << line  << ":" << column << "\n";
        std::cout << "remaining: '" << std::string(iter, last) << "'\n";
        std::cout << "-------------------------\n";
    }
    return 0;
}

[1] sadly un(der)documented, except for the conjure sample(s)

[2] well, I used as_string to get proper assignment to Identifier without too much work

[3] There could be smarter ways about this in terms of performance, but for now, let's keep it simple

sehe
  • 374,641
  • 47
  • 450
  • 633
  • @Gregory81 doing this without `on_success` is quite Okay, but in my opinion it doesn't "scale" well for larger grammars. You'd be looking at using _semantic actions_ (see e.g. [this answer](http://stackoverflow.com/a/19145704/85371), or the [`text_node_t` builder from this larger example](http://stackoverflow.com/questions/8358975/cross-platform-way-to-get-line-number-of-an-ini-file-where-given-option-was-foun/8365427#8365427)) ... – sehe Oct 27 '13 at 16:36
  • ... or look at the [`iter_pos` custom directive](http://stackoverflow.com/questions/15814328/how-do-i-capture-the-original-input-into-the-synthesized-output-from-a-spirit-gr/15818129#15818129) from the [spirit repository](http://www.boost.org/doc/libs/1_54_0/libs/spirit/repository/doc/html/index.html). – sehe Oct 27 '13 at 16:36
  • @Gregory81 I prefer to keep my "consulting" limited to stack overflow. This is because I'm convinced that it serves best for future users looking for similar information. Also, the system is open, and I'm frequently surprised with the excellent solutions that others come up with. Everybody wins. – sehe Oct 27 '13 at 16:46
  • [Deleted my previous comment with my email to not tempt spammers] @sehe, 1. Ok, I understand you position about private contact. 2. I've tried to compile the code you provided and It seems it uses few cxx11 features, my compiler (visual studio 2010) was not happy with some constructs. 3. There few things in your code that due to my lack of boost-spirt knowledge I don't understand. 3.1 What are 'expectation points' ? Could please you direct me to some doc/tutorial/info so I could famiarize with it ? – Grzegorz Wolszczak Oct 27 '13 at 17:16
  • continuation... 3.2 Could you please tell me , for the sake of excersices, how would line `VarAssignment = KeywordVar > Identifier > '=' > int_ > SemiColon;` looked like if I wanted semantic actions instead of on_success ? I understand that it would not 'scale' as you said, but semantic actions are much better documented that on_success. 3.2 Is it posible to access position iterator for currently matched rule/lexeme in semantic action and pass it to some function call ? – Grzegorz Wolszczak Oct 27 '13 at 17:32
  • @Gregory81 [This](http://coliru.stacked-crooked.com/a/5864737193002b37) is a version that compiles in clang ang g++ without -std=c++11, hopefully it will also work in your vs. You can find the documentation of the expectation parser [here](http://www.boost.org/libs/spirit/doc/html/spirit/qi/reference/operator/expect.html). It simply causes an exception if its left argument succeeds and the right one fails (if `var` is not followed by an identifier for example). `on_error` allows you to deal with the exception. – llonesmiz Oct 28 '13 at 21:46
  • Thanks for the leg-work there @cv_and_he. I was going to get back - just didn't know when :( – sehe Oct 28 '13 at 21:54
  • 1
    @sehe Thanks for this answer, it condenses perfectly the way to use `on_success` that is a little difficult to follow in the conjure example, due to the big number of files. – llonesmiz Oct 28 '13 at 22:02
  • I just used your approach into my own parser, to annotate my AST nodes with their spans. It's working well, thanks for your solution. There is a limitation, though: because ParseGrammar takes the iterator in its constructor, an instance of it can be used to parse only one input text. To work around it, my version of annotation_f takes a pointer to ParseGrammar in its constructor, and it is ParseGrammar that hosts the iterator as a member. The value of this member has to be set before parsing an input. – barjak Jul 14 '18 at 10:44