boost spirit parsing with no skipper

Question

Think about a preprocessor which will read the raw text (no significant white space or tokens).

There are 3 rules.

resolve_para_entry should solve the Argument inside a call. The top-level text is returned as string.
resolve_para should resolve the whole Parameter list and put all the top-level Parameter in a string list.
resolve is the entry

On the way I track the iterator and get the text portion

Samples:

sometext(para) → expect para in the string list
sometext(para1,para2) → expect para1 and para2 in string list
sometext(call(a)) → expect call(a) in the string list
sometext(call(a,b)) ← here it fails; it seams that the "!lit(',')" wont take the Parser to step outside ..

Rules:

resolve_para_entry = +(  
     (iter_pos >> lit('(') >> (resolve_para_entry | eps) >> lit(')') >> iter_pos) [_val=  phoenix::bind(&appendString, _val, _1,_3)]
     | (!lit(',') >> !lit(')') >> !lit('(') >> (wide::char_ | wide::space))         [_val = phoenix::bind(&appendChar, _val, _1)]
    );

resolve_para = (lit('(') >> lit(')'))[_val = std::vector<std::wstring>()]  // empty para -> old style
    | (lit('(') >> resolve_para_entry >> *(lit(',') >> resolve_para_entry) > lit(')'))[_val = phoenix::bind(&appendStringList, _val, _1, _2)]
    | eps;
  ;

resolve = (iter_pos >> name_valid >> iter_pos >> resolve_para >> iter_pos);

In the end doesn't seem very elegant. Maybe there is a better way to parse such stuff without skipper

sehe · Accepted Answer · 2017-11-16T16:58:13.393

Indeed this should be a lot simpler.

First off, I fail to see why the absense of a skipper is at all relevant.

Second, exposing the raw input is best done using qi::raw[] instead of dancing with iter_pos and clumsy semantic actions¹.

Among the other observations I see:

negating a charset is done with ~, so e.g. ~char_(",()")
(p|eps) would be better spelled -p
(lit('(') >> lit(')')) could be just "()" (after all, there's no skipper, right)
p >> *(',' >> p) is equivalent to p % ','

With the above, resolve_para simplifies to this:

resolve_para = '(' >> -(resolve_para_entry % ',') >> ')';

resolve_para_entry seems weird, to me. It appears that any nested parentheses are simply swallowed. Why not actually parse a recursive grammar so you detect syntax errors?

Here's my take on it:

Define An AST

I prefer to make this the first step because it helps me think about the parser productions:

namespace Ast {

    using ArgList = std::list<std::string>;

    struct Resolve {
        std::string name;
        ArgList arglist;
    };

    using Resolves = std::vector<Resolve>;
}

Creating The Grammar Rules

qi::rule<It, Ast::Resolves()> start;
qi::rule<It, Ast::Resolve()>  resolve;
qi::rule<It, Ast::ArgList()>  arglist;
qi::rule<It, std::string()>   arg, identifier;

And their definitions:

identifier = char_("a-zA-Z_") >> *char_("a-zA-Z0-9_");

arg        = raw [ +('(' >> -arg >> ')' | +~char_(",)(")) ];
arglist    = '(' >> -(arg % ',') >> ')';
resolve    = identifier >> arglist;

start      = *qr::seek[hold[resolve]];

Notes:

No more semantic actions
No more eps
No more iter_pos
I've opted to make arglist not-optional. If you really wanted that, change it back:
```
resolve    = identifier >> -arglist;
```
But in our sample it will generate a lot of noisy output.
Of course your entry point (start) will be different. I just did the simplest thing that could possibly work, using another handy parser directive from the Spirit Repository (like iter_pos that you were already using): seek[]
The hold is there for this reason: boost::spirit::qi duplicate parsing on the output - You might not need it in your actual parser.

Live On Coliru

#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/repository/include/qi_seek.hpp>

namespace Ast {

    using ArgList = std::list<std::string>;

    struct Resolve {
        std::string name;
        ArgList arglist;
    };

    using Resolves = std::vector<Resolve>;
}

BOOST_FUSION_ADAPT_STRUCT(Ast::Resolve, name, arglist)

namespace qi = boost::spirit::qi;
namespace qr = boost::spirit::repository::qi;

template <typename It>
struct Parser : qi::grammar<It, Ast::Resolves()>
{
    Parser() : Parser::base_type(start) {
        using namespace qi;

        identifier = char_("a-zA-Z_") >> *char_("a-zA-Z0-9_");

        arg        = raw [ +('(' >> -arg >> ')' | +~char_(",)(")) ];
        arglist    = '(' >> -(arg % ',') >> ')';
        resolve    = identifier >> arglist;

        start      = *qr::seek[hold[resolve]];
    }
  private:
    qi::rule<It, Ast::Resolves()> start;
    qi::rule<It, Ast::Resolve()>  resolve;
    qi::rule<It, Ast::ArgList()>  arglist;
    qi::rule<It, std::string()>   arg, identifier;
};

#include <iostream>

int main() {
    using It = std::string::const_iterator;
    std::string const samples = R"--(
Samples:

sometext(para)        → expect para in the string list
sometext(para1,para2) → expect para1 and para2 in string list
sometext(call(a))     → expect call(a) in the string list
sometext(call(a,b))   ← here it fails; it seams that the "!lit(',')" wont make the parser step outside
)--";
    It f = samples.begin(), l = samples.end();

    Ast::Resolves data;
    if (parse(f, l, Parser<It>{}, data)) {
        std::cout << "Parsed " << data.size() << " resolves\n";

    } else {
        std::cout << "Parsing failed\n";
    }

    for (auto& resolve: data) {
        std::cout << " - " << resolve.name << "\n   (\n";
        for (auto& arg : resolve.arglist) {
            std::cout << "       " << arg << "\n";
        }
        std::cout << "   )\n";
    }
}

Prints

Parsed 6 resolves
 - sometext
   (
       para
   )
 - sometext
   (
       para1
       para2
   )
 - sometext
   (
       call(a)
   )
 - call
   (
       a
   )
 - call
   (
       a
       b
   )
 - lit
   (
       '
       '
   )

More Ideas

That last output shows you a problem with your current grammar: lit(',') should obviously not be seen as a call with two parameters.

I recently did an answer on extracting (nested) function calls with parameters which does things more neatly:

Boost spirit parse rule is not applied
or this one boost spirit reporting semantic error

BONUS

Bonus version that uses string_view and also shows exact line/column information of all extracted words.

Note that it still doesn't require any phoenix or semantic actions. Instead it simply defines the necesary trait to assign to boost::string_view from an iterator range.