How to write a recursive descent parser?

Question

C++ I can't figure out how to write a recursive descent parser for the following grammar:

<Datalog Program>     ->  Schemes : <Scheme List>     
                         Facts   : <Fact List>
                         Rules   : <Rule List>
                         Queries : <Query List>
                         <EOF>
<Scheme List>         ->  <Scheme> <Scheme List Tail>
<Scheme List Tail>    ->  <Scheme List> | ε
<Scheme>              ->  <Identifier> ( <Identifier List> )
<Identifier List>     ->  <Identifier> <Identifier List Tail>
<Identifier List Tail>->  , <Identifier List> | ε
<Fact List>           ->  <Fact> <Fact List> | ε
<Fact>                ->  <Identifier> ( <Constant List> ) .
<Constant List>       ->  <String> <Constant List Tail>
<Constant List Tail>  ->  , <Constant List> | ε
<Rule List>           ->  <Rule> <Rule List> | ε
<Rule>                ->  <Head Predicate> :- <Predicate List> .
<Head Predicate>      ->  <Identifier> ( <Identifier List> )
<Predicate List>      ->  <Predicate> <Predicate List Tail>
<Predicate List Tail> ->  , <Predicate List> | ε
<Predicate>           ->  <Identifier> ( <Parameter List> )
<Parameter List>      ->  <Parameter> <Parameter List Tail>
<Parameter List Tail> ->  , <Parameter List> | ε
<Parameter>           ->  <String> | <Identifier> | <Expression>
<Expression>          -> ( <Parameter> <Operator> <Parameter> )
<Operator>            -> + | *
<Query List>          ->  <Query> <Query List Tail>
<Query List Tail>     ->  <Query List> | ε
<Query>               ->  <Predicate> ?

This is a simple datalog-like grammar. I'm absolutely lost in trying to write the parser. I've already written a Lexical Analyzer that outputs a vector with all the tokens. I know that I need to write methods for each production, but I can't figure out how to connect the tokens into a parse tree (so I can run a toString() function after the tree is complete). I need to be pointed into the right direction. Thanks.

There are many texts on the matter, it is treated e.g. in the mythical dragon book, Aho, Sethi, Ullman's "Compilers: Principles, techniques and tools". IIRC in Wirth's "Algorithms + Data Structures = Programs" (it might have been another book of his) there is a _very_ lucid explanation on building a recursive descent parser. — vonbrand, Feb 10 '13 at 03:22
Dumb question: Can't you use a tool like bison or byacc? It is much easier to write/maintain the parser's source code that way. Those (and other) tools give standalone C programs (or C++ too for bison). — vonbrand, Feb 10 '13 at 03:24
Honestly, getting the grammar properly factored and removing left-recursion is the hard part, and it appears you're already there. This thing should damn-near write *itself* at this point. Assuming you have a proper lexer, you can think of every production as a function to be called, sprinkled with token-fetches to validate the next production path along the way. Make the proper decisions, and do *not* take shortcuts until you have it working fully *first*. — WhozCraig, Feb 10 '13 at 03:27
@WhozCraig I know most of the work is done at this point - but I can't wrap my head around the recursive descent parser. What happens inside of the production methods? How is the full tree built and connected? — amorimluc, Feb 10 '13 at 03:30
The way I've done it is to use a global stack of parse nodes. Each parse method that calls other parse methods should always expect those calls to produce output on the stack. What the output forms is dependent on what was called. Tokens almost always go on the stack. When a production method is finished, it pops what was produced *for it* off the stack, gathering them into its own parse node, and pushes it onto the stack. When finished, a successful parse should have *one* node on the stack. the one that started it all; the rest are treed up from there. Hope that made sense. — WhozCraig, Feb 10 '13 at 03:40
@WhozCraig You have just described a weird combination of Knuth LR style bottom-up parsing with hand coding. Not recursive descent. — user207421, Feb 10 '13 at 22:09
@EJP If it wasn't clear, I code the tree construction *in* the RDP. As each parse member needs to, it pulls results from its calls from the stack. I assure you there is plenty of recursive descent in how I do it. The best on-web description I could find on how I normally do it starts on slide #7 of [this powerpoint deck](http://www.cis.upenn.edu/~matuszek/cit594-2005/Lectures/25-recursive-descent-parsing.ppt). Never considered it "weird", but perhaps the definitions of top-down vs. bottom-up changed over the few decades I've been at this. — WhozCraig, Feb 11 '13 at 00:56
See http://stackoverflow.com/a/2336769/120163 for a discussion of writing RDP, with auxialiary discussion on how to build a tree as you parse. — Ira Baxter, Nov 30 '15 at 14:44

score 2 · Accepted Answer · answered Feb 10 '13 at 04:29

Given what you've already written, it's mostly a mechanical translation from EBNF to C++ source code, so (for example), a rule like:

<Scheme>              ->  <Identifier> ( <Identifier List> ) 
<Identifier List>     ->  <Identifier> <Identifier List Tail>
<Identifier List Tail>->  , <Identifier List> | ε

Translates to something on this general order (though be careful -- I'm just typing this off the top of my head, so it hasn't been tested at all; think of it more as C-like pseudo-code than anything to compile as-is).

bool scheme() { 
    return identifier()        &&
           check_input('(')    &&
           identifier_list()   &&
           check_input(')');
}

bool identifier_list() { 
    if (identifier()) {
        if (check_input(','))
            return identifier_list();
        else
            return true;
    }
    return false;
}

bool identifier() { 
    // you haven't said what's a legal identifier, so for the moment I'm assuming
    // rules roughly like C's.
    int ch;
    std::string id;

    if (!((ch=getinput()) == '_' || isalapha(ch))
        return false;

    do { 
        id += ch;
        ch = getinput();
    } while (isalpha(ch) || isdigit(ch) || ch == '_');
    putback(ch);
    return true;
}

Once you have it recognizing the inputs correctly (and building the identifiers and such, as I've done above, even though I've done nothing with it) you can start to build a parse tree (or whatever) adding things to the tree as you recognize them. If you're doing that in C, you usually use a union. In C++, you might want to use a class hierarchy instead (though it'll be a pretty short, bushy hierarchy -- in fact, could easily be all the others derived directly from the base class).

score 1 · Answer 2 · answered Feb 10 '13 at 04:54

It's not too easy to see what's going up in your grammar, at least if one doesn't know Datalog, or whatever the language is called.

It would have been a lot easier - not only to read for the unknowing, but also for yourself if you had used different Markup for Terminals and Productions. It is sure possible to figure out which one one is which by hunting for-and backward but it seems to make me a little dizzy.

Taking a look at your grammar it doesn't seem you'll end up with a rather complex tree. As all items are ordered.

My suggestion would be to just model the tree according to your Grammmar i.e. datastructs for: DatalogProgram Schmema Fact Rule and Query as well as Expression HeadPredicate and Predicate

This way all you'd have to do to make your Parser work would look something like

DatalogProgram parse_datalog_progrem (pos){
   DatalogProgram program = new

   while ( Scheme s = parse_scheme () )
       program.append_scheme (s);

   while ( Fact f = parse_fact () )
       program.append_scheme (f);

   while ( Rule r = parse_rule () )
       program.append_rule (r);

   while ( Query q = parse_query () )
       program.append_query (q);

   return program; 
}

Scheme parse_scheme () {
   Scheme s = new ...
   s.id = parse_identifier ();

   parse ('(')

   while (Identifier i = parse_identifier) {
     s.append_id_list (i);
     if (lookahead != ',')
        break;
     parse (',');
   }
   parse (')');
   return s;
}

....

... and so on. I think you get the point.

There still seems to be one or the other thing to work around, but I think this is the best way to go. You e.g. might want to think about using a union in a datastruct like

struct Parameter {
  enum ParamType type;
  union {
     String  str;
     Identifier id;
     Expression expr;
  }
}

It's still a long long way to go from here, but I hope you like the direction.

.. oh and DatalogProgram something like:

struct DatalogProgram {
  struct Scheme *schemes;
  struct Fact *facts;
  struct Rule *rules;
  struct Query *queries;
}

for the root of your parse tree.

How to write a recursive descent parser?

2 Answers2