How to make boost::spirit parser and lexer being able to deal with include files

Question

This is a do-nothing lexer&parser -- it returns the string read. I would like to have this extended to be able to deal with a C++-like include statement. I can imagine how to do this -- but I would like to know if there is some easier or already available way. If I would have to do this, I would implement my own iterator (to be passed to the lexer). This iterator would contain

an index into a string (potentially using -1 to indicate end() iterator)
a pointer to this string

The lexer on encountering some include statement would insert the file into the string at the current position overwriting the include statement. How would you do this?

Here is my do-nothing lexer/parser:

#include <boost/phoenix.hpp>
#include <boost/bind.hpp>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>
#include <algorithm>
#include <iostream>
#include <string>
#include <utility>
#include <vector>

namespace lex     = boost::spirit::lex;
namespace qi      = boost::spirit::qi;
namespace phoenix = boost::phoenix;


template<typename Lexer>
class lexer:public lex::lexer<Lexer>
{   public:
    typedef lex::token_def<char> char_token_type;
    char_token_type m_sChar;
    //lex::token_def<lex::omit> m_sInclude;
    lexer(void)
        : m_sChar(".")//,
        //m_sInclude("^#include \"[^\"]*\"")
    {   this->self += m_sChar;
    }
};

template<typename Iterator>
class grammar : public qi::grammar<Iterator, std::string()>
{   public:
    qi::rule<Iterator, std::string()> m_sStart;
    template<typename Tokens>
    explicit grammar(Tokens const& tokens)
        : grammar::base_type(m_sStart)
    {   m_sStart %= *tokens.m_sChar >> qi::eoi;
    }
};


int main(int, char**)
{
    typedef lex::lexertl::token<std::string::const_iterator, boost::mpl::vector<char> > token_type;
    typedef lexer<lex::lexertl::actor_lexer<token_type> > expression_lexer_type;
    typedef expression_lexer_type::iterator_type expression_lexer_iterator_type;
    typedef grammar<expression_lexer_iterator_type> expression_grammar_type;

    expression_lexer_type lexer;
    expression_grammar_type grammar(lexer);
    const std::string s_ac = "this is a test\n\
#include \"test.dat\"\n\
";
    std::string s;
    auto pBegin = std::begin(s_ac);
        lex::tokenize_and_parse(pBegin, std::end(s_ac), lexer, grammar, s);
}

score 2 · Answer 1 · answered Nov 06 '17 at 22:12

Firstly, a preprocessor based on Spirit exists: Boost Wave (see also How do I implement include directives using boost::spirit::lex?)

Secondly, "inserting the contents of a the include file into the string value" is both useless (for lexing purposes) and highly inefficient:

it's useless because the include file will form a single token (!?) which means your parser can't act on the included contents
it's not generic because nested includes are not going to happen this way
even if the goal is only to /copy/ the include file verbatim to an equivalent output stream, it's horrifically inefficient to do so by copying the contents fully into memory, copying it around through the lexer into a parser, only to stream it out. You could just siphon the input stream into the output stream with minimal allocations instead.

I'd suggest any combination of the following:

separate concerns: don't conflate parsing with interpreting. So, if you're gonna parse include directives, you'll return a representation of the include statements, that can be then be passed to code that interprets it
a special, stronger case of separation of concerns is to move the include-handling to a preprocessing stage. Indeed, a custom iterator type could do the trick, but I'd build the lexer on top of it, so the lexer doesn't have to know about includes, instead just lexing the source, without (having to) know the exact origin.

Performance is not yet my concern! Understanding is! Wave is much too complicated. This is the reason for stackoverflow.com — , Nov 06 '17 at 22:27
Perhaps you should have made your constraints and prior research clear. Also, I cannot help but notice the enormous clash between "Wave is much too complicated" and the subsequent move to start using Spirit with a Lexer - this is hardly the area where Spirit shines, and Spirit Qi has a steep learning curve even without the complications of Lex. Just a fair warning. — sehe, Nov 06 '17 at 22:30

score 1 · Answer 2 · answered Nov 06 '17 at 23:14

the code below replaces the include statement with "abcd" -- which is supposed to be the contents of the file...

#include <boost/phoenix.hpp>
#include <boost/bind.hpp>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix_core.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/phoenix/object.hpp>
#include <boost/spirit/include/qi_char_class.hpp>
#include <boost/spirit/include/phoenix_bind.hpp>
#include <boost/mpl/index_of.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>

#include <algorithm>
#include <iostream>
#include <string>
#include <utility>
#include <vector>
#include <iterator>


namespace lex     = boost::spirit::lex;
namespace qi      = boost::spirit::qi;
namespace phoenix = boost::phoenix;

struct myIterator:std::iterator<std::random_access_iterator_tag, char>
{   std::string *m_p;
    std::size_t m_iPos;
    myIterator(void)
        :m_p(nullptr),
        m_iPos(~std::size_t(0))
    {
    }
    myIterator(std::string &_r, const bool _bEnd = false)
        :m_p(&_r),
        m_iPos(_bEnd ? ~std::size_t(0) : 0)
    {
    }
    myIterator(const myIterator &_r)
        :m_p(_r.m_p),
        m_iPos(_r.m_iPos)
    {
    }
    myIterator &operator=(const myIterator &_r)
    {   if (this != &_r)
        {   m_p = _r.m_p;
            m_iPos = _r.m_iPos;
        }
        return *this;
    }
    const char &operator*(void) const
    {   return m_p->at(m_iPos);
    }
    bool operator==(const myIterator &_r) const
    {   return m_p == _r.m_p && m_iPos == _r.m_iPos;
    }
    bool operator!=(const myIterator &_r) const
    {   return m_p != _r.m_p || m_iPos != _r.m_iPos;
    }
    myIterator &operator++(void)
    {   ++m_iPos;
        if (m_iPos == m_p->size())
            m_iPos = ~std::size_t(0);
        return *this;
    }
    myIterator operator++(int)
    {   const myIterator s(*this);
        operator++();
        return s;
    }
};
struct include
{   auto operator()(myIterator &_rStart, myIterator &_rEnd) const
    {       // erase what has been matched (the include statement)
        _rStart.m_p->erase(_rStart.m_iPos, _rEnd.m_iPos - _rStart.m_iPos);
            // and insert the contents of the file
        _rStart.m_p->insert(_rStart.m_iPos, "abcd");
        _rEnd = _rStart;
        return lex::pass_flags::pass_ignore;
//lex::_pass = lex::pass_flags::pass_ignore
    }
};
template<typename Lexer>
class lexer:public lex::lexer<Lexer>
{   public:
    typedef lex::token_def<char> char_token_type;
    char_token_type m_sChar;
    lex::token_def<lex::omit> m_sInclude;
    lexer(void)
        : m_sChar("."),
        m_sInclude("#include [\"][^\"]*[\"]")
    {   this->self += m_sInclude[lex::_pass = boost::phoenix::bind(include(), lex::_start, lex::_end)]
            | m_sChar;
    }
};

template<typename Iterator>
class grammar : public qi::grammar<Iterator, std::string()>
{   public:
    qi::rule<Iterator, std::string()> m_sStart;
    template<typename Tokens>
    explicit grammar(Tokens const& tokens)
        : grammar::base_type(m_sStart)
    {   m_sStart %= *tokens.m_sChar >> qi::eoi;
    }
};


int main(int, char**)
{
    typedef lex::lexertl::token<myIterator, boost::mpl::vector<char> > token_type;
    typedef lexer<lex::lexertl::actor_lexer<token_type> > expression_lexer_type;
    typedef expression_lexer_type::iterator_type expression_lexer_iterator_type;
    typedef grammar<expression_lexer_iterator_type> expression_grammar_type;

    expression_lexer_type lexer;
    expression_grammar_type grammar(lexer);
    std::string s_ac = "this is a test\n\
#include \"test.dat\"\n\
";
    std::string s;
    myIterator pBegin(s_ac);
        lex::tokenize_and_parse(pBegin, myIterator(s_ac, true), lexer, grammar, s);
}

This is actually not bad. I'm a bit worried how it will work out when dealing with [multi_pass](http://www.boost.org/doc/libs/1_65_1/libs/spirit/doc/html/spirit/support/multi_pass.html) iterators, like you would with filestreams, but I didn't manage that in the allotted time when I tried either. This being simple as it is has some elegance. +1 — sehe, Nov 09 '17 at 00:38

How to make boost::spirit parser and lexer being able to deal with include files

2 Answers2

Linked