Optimizing a boost::spirit::qi parser

Question

I have a parser that basically prints out the actions of a stack machine with my operator precedence given some expression. My goal is to optimize for speed as much as possible. I have read an article concerning qi optimizations that provides this example code. I understand the gist of the optimizations described in the main article, however I am unclear how to integrate this into my code.

Here is a following working example of my parser. I have already tried to optimize it somewhat by using raw[] to provide base iterators. The phoenix action calls must be given strings or iterators by which they can create strings; the real versions of these functions are not trivial and their functionality can not yet be evaluated in parsing-time:

#include <iostream>
#include <vector>
#include <string>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_char.hpp>
#include <boost/spirit/include/qi_parse.hpp>
#include <boost/spirit/include/phoenix_bind.hpp>
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
using std::endl;
using std::cout;
using std::string;
using std::vector;

void fPushOp(const char* op){
  cout << "PushOp: " << op << endl;
}

void fPushInt(const boost::iterator_range<string::const_iterator>& my_str){
  cout << "PushInt: " << my_str << endl;
}

template<typename Iterator, typename Skipper = qi::space_type>
struct Calculator : public qi::grammar<Iterator, Skipper> {

  qi::rule<Iterator, Skipper>  
    expression, logical_or_expression, logical_and_expression, negate_expression, series_expression,
    single_expression, inclusive_or_expression, exclusive_or_expression, and_expression, equality_expression, 
    relational_expression, shift_expression, additive_expression, multiplicative_expression, 
    term, complement_factor, factor, result, integer, variable, variable_combo, word, prefix;

  qi::rule<Iterator> number;
  Calculator() : Calculator::base_type(result)
  {
    number = 
        qi::raw[
            ("0x" >> +qi::char_("0-9a-fA-F"))     
          | ("0b" >> +qi::char_("0-1"))
          | ("0" >>  +qi::char_("0-7"))
          | (+qi::char_("0-9"))
        ] [phx::bind(&fPushInt, qi::_1)]
        ;

    integer = 
          number
        | ('-' >> number) [phx::bind(&fPushOp, "OP_UNARY_MINUS")]
        ;

    variable =
          ((qi::alpha | qi::char_('_')) 
              >> *(qi::alnum | qi::char_('_')) 
              >> '['
              >>  (+(qi::alnum | qi::char_('_') | qi::char_(',')) 
                | ('\'' >> *~qi::char_('\'') >> '\'')) 
              >> ']')
        | ((qi::alpha | qi::char_('_')) >> *(qi::alnum | qi::char_('_')))
        ;

    variable_combo =
        qi::raw [
          variable >> *(qi::char_('.') >> variable)
        ] [phx::bind(&fPushInt, qi::_1)]
        ;

    word = 
        qi::raw[
          variable
        ] [phx::bind(&fPushInt, qi::_1)]
        ;

    factor =
            ("ceil(" >> expression >> ')')                                                      [phx::bind(&fPushOp, "OP_CEIL")]
        |   ("wrap(" >> expression >> ')')                                                      [phx::bind(&fPushOp, "OP_WRAP")]
        |   ("abs(" >> expression >> ')')                                                       [phx::bind(&fPushOp, "OP_ABS")]
        |   ("count1(" >> expression >> ')')                                                    [phx::bind(&fPushOp, "OP_COUNT1")]
        |   ("pick(" >> expression >> ')')                                                      [phx::bind(&fPushOp, "OP_PICK")]
        |   ("defined(" >> expression >> ')')                                                   [phx::bind(&fPushOp, "OP_DEF")]
        |   ("string_equal(" >> word >> ',' >> word >> ')')                                     [phx::bind(&fPushOp, "OP_STREQ")]
        |   ("string_contains(" >> word >> ',' >> word >> ')')                                  [phx::bind(&fPushOp, "OP_STRCON")]
        |   ("lsl(" >> single_expression >> ',' >> single_expression >> ',' >> number >> ')')   [phx::bind(&fPushOp, "OP_LSL")]
        |   ("lsr(" >> single_expression >> ',' >> single_expression >> ')')                    [phx::bind(&fPushOp, "OP_LSR")]
        |   ("asr(" >> single_expression >> ',' >> single_expression >> ',' >> number >> ')')   [phx::bind(&fPushOp, "OP_ASR")]
        |   ("ror(" >> single_expression >> ',' >> single_expression >> ',' >> number >> ')')   [phx::bind(&fPushOp, "OP_ROR")]
        |   ("rrx(" >> single_expression >> ',' >> single_expression >> ',' >> single_expression >> ',' >> number >> ')')[phx::bind(&fPushOp, "OP_RRX")]
        |   ('(' >> expression >> ')')
        |   variable_combo
        |   integer
        ;
    complement_factor = factor
        | ('~' >> factor) [phx::bind(&fPushOp, "OP_COMPLEMENT")]
        ;
    term = complement_factor
      >> *( (".." >> complement_factor) [phx::bind(&fPushOp, "OP_LEGER")]
          | ('\\' >> complement_factor) [phx::bind(&fPushOp, "OP_MASK")]
          ); 
    multiplicative_expression = term
      >> *( ('/' >> term) [phx::bind(&fPushOp, "OP_DIV")]
          | ('%' >> term) [phx::bind(&fPushOp, "OP_MOD")]
          | ('*' >> term) [phx::bind(&fPushOp, "OP_MUL")]
          );
    additive_expression = multiplicative_expression
      >> *( ('+' >> multiplicative_expression)  [phx::bind(&fPushOp, "OP_ADD")]
          | ('-' >> multiplicative_expression)  [phx::bind(&fPushOp, "OP_SUB")]
          );
    shift_expression = additive_expression
      >> *( (">>" >> additive_expression) [phx::bind(&fPushOp, "OP_SRL")]
          | ("<<" >> additive_expression) [phx::bind(&fPushOp, "OP_SLL")]
          );
    relational_expression = shift_expression
      >> *( ('<' >> shift_expression) [phx::bind(&fPushOp, "OP_LT")]
          | ('>' >> shift_expression) [phx::bind(&fPushOp, "OP_GT")]
          | ("<=" >> shift_expression)[phx::bind(&fPushOp, "OP_LET")]
          | (">=" >> shift_expression)[phx::bind(&fPushOp, "OP_GET")]
          );
    equality_expression = relational_expression 
      >> *( ("==" >> relational_expression)[phx::bind(&fPushOp, "OP_EQ")]
          | ("!=" >> relational_expression)[phx::bind(&fPushOp, "OP_NEQ")] 
          );
    and_expression = equality_expression 
      >> *(('&' >> equality_expression)     [phx::bind(&fPushOp, "OP_AND")]); 
    exclusive_or_expression = and_expression 
      >> *(('^' >> and_expression)          [phx::bind(&fPushOp, "OP_XOR")]); 
    inclusive_or_expression = exclusive_or_expression 
      >> *(('|' >> exclusive_or_expression) [phx::bind(&fPushOp, "OP_OR")]); 
    single_expression = inclusive_or_expression;
    series_expression = inclusive_or_expression 
      >> *((',' >> inclusive_or_expression) [phx::bind(&fPushOp, "OP_SERIES")]);
    negate_expression = series_expression
        | ('!' >> series_expression)        [phx::bind(&fPushOp, "OP_NEGATE")];
    logical_and_expression = negate_expression
      >> *(("&&" >> negate_expression)      [phx::bind(&fPushOp, "OP_LOGICAL_AND")]); 
    logical_or_expression = logical_and_expression 
      >> *(("||" >> logical_and_expression) [phx::bind(&fPushOp, "OP_LOGICAL_OR")]);
    expression = logical_or_expression;

    result = expression;
  }
};

int main(){
  Calculator<string::const_iterator> calc;
  const string expr("0xff0000 >> 3 && 3 + (!9) | (0,200)");
  cout << "Expression: " << expr << endl;

  string::const_iterator it = expr.begin();
  phrase_parse(it, expr.end(), calc, qi::space);

  cout << "Remaining: " << (string(it,expr.end())) << endl;
  return 0;
}

Additionally, I read the slides from this pdf concerning utree and am trying to figure out how, if possible, to use the utree output instead of semantic actions since apparently such things are evil. Can someone provide or point me to a simple example on how to construct a utree that can then be fed to a stack machine to print out operations in order?

I'd recommend against `utree` as it never got fully supported and it looks like it never will (because there's little gain). Also, I don't anticipate performance win. The parsing might be fast, but you'll need to postprocess to get something done. — sehe, Aug 28 '14 at 07:15
It looks like you just added the Skipper to all the rules, which is likely not what you want (you don't want `foo_ bar` to parse as a valid `variable`). If you look at my previous answer, you'll see that I tried to make a quick separation of rules which require the skipper and those that don't. Also look at `lexeme[]` for exceptions inside parser expressions — sehe, Aug 28 '14 at 07:15
If you want to optimize for speed "as much as possible", hand roll the parser. Also, where is the bottleneck? It looks like you're just printing bits of the input, but that is likely not your real goal, so it would seem silly to optimize for printing the input instead. — sehe, Aug 28 '14 at 07:19

score 1 · Answer 1 · edited May 23 '17 at 11:50

The optimizations depend on what you want to achieve. Therefore, I think you're optimizing prematurely.

E.g. parsing a variable_combo as a raw[] input sequence does not make any sense if you want to interpret the symbols later (because you'll have to parse the variable combo again, and you'll even have to anticipate whitespace in there: "foo . bar .tux" is a valid variable combo here).

I have quite a lot of posts in general dealing with optimizing Boost Spirit (start here e.g.). Quick observations here:

consider correctness under backtracking; with your grammar parsing 'ceil(3.7') you'll get:
```
Expression: ceil(3.7)
PushInt: 3
PushInt: ceil
Remaining: (3.7)
```
Note how this emits opcodes when parsing failed. Note also, it emits the wrong opcodes
- it pushes 3 instead of 3.7
- it pushes ceil as an PushInt?
So not only does it detect failure to parse too late, it just ignores the parentheses, fails to spot the function call and parses the number wrong.

Regarding the premature evaluation, I'm going to point to this popular answer: Boost Spirit: "Semantic actions are evil"?

Other than that, I'm just confirming my suspicion that you're doing premature optimization. Consider doing
```
#define BOOST_SPIRIT_DEBUG
```
and then later, in the grammar constructor:
```
BOOST_SPIRIT_DEBUG_NODES(
        (expression)(logical_or_expression)(logical_and_expression)(negate_expression)(series_expression)(single_expression)
        (inclusive_or_expression)(exclusive_or_expression)(and_expression)(equality_expression)(relational_expression)
        (shift_expression)(additive_expression)(multiplicative_expression)(term)(complement_factor)(factor)(result)(integer)
        (variable)(variable_combo)(word)(prefix)
```
To really see how your parser behaves.

consider qi::symbols e.g.:

qi::symbols<char,const char*> unary_function;

unary_function.add
    ("ceil",    "OP_CEIL")
    ("wrap",    "OP_WRAP")
    ("abs",     "OP_ABS")
    ("count1",  "OP_COUNT1")
    ("pick",    "OP_PICK")
    ("defined", "OP_DEF");

unary_call = (unary_function >> "(" >> expression >> ')') [phx::bind(&fPushOp, qi::_1)];

traits might leave more potential for the compiler to optimize after inlining (as opposed to semantic actions, since the many levels of template instantiation can obscure some cases, especially when bind is involved)

You may want to make operator precedence table driven here, as some of the spirit samples show. The traditional way to use rule-hierarchy to enforce precedence that you've taken complicates the grammar. This has two key downsides:

each rule introduces a virtual dispatch (Spirit X3 may not have this limitation anymore in the future)
your grammar got so complicated that you lost the overview already (see first bullet)

Recommendations

I'd suggest

moving away from evaluating during parsing as the semantic actions grow unwieldy, and are very (very) tricky to get right in the face of (late) backtracking (or even parser failures; the latter could be detected easily, but backtracking can also be benign and very hard to correct for when semantic actions have side effects).
start building the grammar from the simplest rule, gradually building it as you add test cases

The reason that your non-integer number doesn't parse is that only integer numbers are allowed for our purposes. I just upgraded our parser from `boost::spirit::classic` to `boost::spirit::qi` only to find that performance actually dropped. I applied a very trivial optimization that could been done to the original code, and now I have equivalent performance. According to `valgrind`, this is the bottleneck of our system by quite a bit. Any improvements here are worth it. Lemme peruse through your answer several times before commenting more. — Suedocode, Aug 28 '14 at 14:20
@Aggieboy ok, I assumed that `ceil(x)` would not make much sense for integer x. I'm happy to look at a larger codebase, if you can share it — sehe, Aug 28 '14 at 14:22
`ceil(x)` is used to evaluate an expression such as `ceil(3/5)`. It just changes how the final value is rounded. The point of this parser is to generate an AST which may contain variables in the form of strings. The ASTs in some form are then fed to a SAT solver that tries to generate solutions that satisfies all AST equations. This is where my interest in `utree` also comes from. In short, it is not possible for us to evaluate everything on the spot with attributes because we don't actually know the value of everything yet. — Suedocode, Aug 28 '14 at 14:32
This is exactly why I think your car would be very very well served by moving away from semantic actions. If you mind sharing some more chose (particularly about the destination AST) I'd be happy to give it a look — sehe, Aug 28 '14 at 14:40
Allow me a little while to sift through some of our code for relevant pieces and put together an example. I'll edit my OP and comment again when I update. (Thank you again for all of your help. I figured I'd just slap that in there real quick lol) — Suedocode, Aug 28 '14 at 14:47
For several reasons I think it might be appropriate to open a new question. — sehe, Aug 28 '14 at 15:07

Optimizing a boost::spirit::qi parser

1 Answers1

Recommendations

Linked