Predictive parser for EBNF productions

Question

I am trying to write a recursive descent parser without backtracking for a kind of EBNF like this:

<a> ::= b [c] | d

where

<a> = non-terminal
lower-case-string = identifier
[term-in-brackets] = term-in-brackets is optional
a|b is the usual mutually exclusive choice between a and b.

For now, I care only about the right hand side.

Following the example at http://en.wikipedia.org/wiki/Recursive_descent_parser, I eventually ended up with the following procedures (rule in GNU bison syntax in comments above):

/* expression: term optional_or_term */
void expression()
{
    term();
    if (sym == OR_SYM)
        optional_or_term();

}

/* optional_or_term: // empty
    | OR_SYM term optional_or_term
*/
void optional_or_term()
{
    while (sym == OR_SYM)
    {
        getsym();
        term();
    }
}

/* term: factor | factor term */
void term()
{
    factor();
    if (sym == EOF_SYM || sym == RIGHT_SQUAREB_SYM)
    {
        ;
    }
    else if (sym == IDENTIFIER_SYM || sym == LEFT_SQUAREB_SYM)
        term();
    else if (sym == OR_SYM)
        optional_or_term();
    else
    {
        error("term: syntax error");
        getsym();
    }

}

/*
factor: IDENTIFIER_SYM  
    | LEFT_SQUAREB_SYM expression RIGHT_SQUAREB_SYM
*/

void factor()
{
    if (accept(IDENTIFIER_SYM))
    {
        ;
    }
    else if (accept(LEFT_SQUAREB_SYM))
    {
        expression();
        expect(RIGHT_SQUAREB_SYM);
    }
    else
    {
        error("factor: syntax error");
        getsym();
    }

}

It seems to be working, but my expectation was that each procedure would correspond closely with the corresponding rule. You will notice that term() does not.

My question is: did the grammar need more transformation before the procedures were written?

Your BNF for term doesn't match the example you say got it from. Why did you change it? — Ira Baxter, Sep 28 '14 at 17:03
It's _based_ on the example. I am trying to write a parser for an EBNF expression (e.g. ` ::= expression`), not an arithmetic expression. In ` ::= id1 id2 | id3`, id1, id2 and id3 are identifiers (and factors and terms and expressions). id1 followed by id2 is also a term (a sequence of factors) and an expression, but there is no concatenation operator. Finally, the entire right-hand side is a term | term (i.e., an expression), so that choice has a lower precedence than concatenation. It is written this way because it doesn't work otherwise. — Test User, Sep 30 '14 at 04:14
I am (and I suspect other readers are, see upvote on my comment) unclear on precisely what grammar you think you are trying to implement. Based on your explicit link, a reasonable person would assume you were trying to implement the grammar at the link. If that's not the grammar you intend, we readers are misdirected. Please show the grammar you want to process explicitly. Frankly none of are actually interested in the examples, just the grammar under the assumption it is reasonable. It might be, or might not be. Can't tell with what you told us. — Ira Baxter, Sep 30 '14 at 04:49
FWIW: See my discusson on writing recursive descent parsers. Their structure *does* match that of the rules: http://stackoverflow.com/a/2336769/120163 — Ira Baxter, Sep 30 '14 at 04:50
I am trying to parse a file. The file is not COBOL, FORTRAN (insert your favourite language here), BASIC, ... The file contains a kind of EBNF that looks like ` ::= b [c] | d`. I have nothing to add to my initial explanation of these symbols. I have aready said that each function has the grammar rule that it implements above it in comments in GNU bison syntax, e.g. `/* expression: term optional_or_term */. This is what I "think" that I am trying to do. I hope that I'm right. Any reasonable example works for me, since I will have to modify it if it parses a different language, which it does. — Test User, Sep 30 '14 at 11:53
Ok, fine, you are parsing a kind of EBNF. *Write down the grammar that you think you implemented*. — Ira Baxter, Sep 30 '14 at 11:55
I believe (but I haven't tested) that ultimately the problem is that concatenation in EBNF does not have an operator: we write ` ::= b c` and not, for example, ` ::= b + c`. There is no terminal symbol `+` to start the right hand side of the production. If that is so, my question is: how do I derive a grammar that expresses this fact in a way that will let me code several hundred such rules in a day or two? This, of course, boils down to my original question. — Test User, Sep 30 '14 at 12:24
expression: term optional_or_term; optional_or_term: // empty | OR_SYM term optional_or_term; term: factor | factor term; factor: IDENTIFIER_SYM | LEFT_SQUAREB_SYM expression RIGHT_SQUAREB_SYM; I don't know how to get line breaks, so each rule is terminated with a semi-colon. — Test User, Sep 30 '14 at 12:31
I did, but I did not see anything that addresses my issue, which is: Is my grammar in the correct form that will let me write a predictive parser that follows mechanically from the grammar, whether I follow your example or another one? Nevertheless, I will re-implement my parser using your guide and I will see what happens. — Test User, Sep 30 '14 at 14:35

score 1 · Accepted Answer · edited May 23 '17 at 12:21

I don't think your problem is the absence of operators for concatenation. I think it is not using Kleene star (and plus) for lists of things. The Kleene star lets you actually code a loop inside a procedure that implements the grammar rule.

I would have written your grammar as:

expression = term (OR_SYM term)*;
term = factor+;
factor = IDENTIFIER_SYM | LEFT_SQUAREB_SYM expression RIGHT_SQUAREB_SYM ;

(This is a pretty classic grammar for a grammar).

The parser code then looks like:

 boolean function expression()
 {   if term()
     {   loop
         { if OR_SYM()
           {  if term()
              {}
              else syntax_error();
           }
           else return true;
         }
     else return false;
 }

 boolean term()
 {  if factor()
    {  loop
       {  if factor()
          {}
          else return true;
       }
    }
    else return false;
 }

 boolean factor()
 {  if IDENTIFIER(SYM)
    return true;
    else 
    { if LEFT_SQUAREB_SYM()
      {  if expression()
         {   if RIGHT_SQUAREB_SYM()
             return true;
             else syntax_error();
         }
         else syntax_error();
      else return false;
    }
 }

I tried to generate this in an absolutely mechanical way, and you can do pretty well like this. I did a lot of this earlier my career.

What you're not going to get is 150 working rules per day. First, for a big language, it is hard to get the grammar right; you'll be tweaking it repeatedly to get a grammar that works in the abstract, then you have to adjust the code you wrote. Next you'll discover that writing the lexer has its troubles too; just try writing a lexer for Java. Finally, you'll discover that parser rules isn't the whole game or even the biggest part of your effort; you need a lot to process real code. I call this "Life After Parsing"; see my bio for more information.

If you want to get 150 working rules per day, switch to a GLR parser and stop coding parsers manually. That won't address the other issues, but it does make you incredibly productive at getting out a usable grammar. This is what I do now. Without exception. (Our DMS Software Reengineering Toolkit uses this, and we parse a lot of things that people claim are hard.)

So yes, I needed to rewrite my grammar using extensions that I was never taught and use more looping and less recursion. Seriously, what the looping does is makes it more convenient to ignore characters that are not wanted _at that moment_. While parsing `[a]`, `a` is recognised as a factor, and I try to gather more factors by calling `factor` recursively. The next character is a ']', which should be ignored at this point so that the calling `factor` which is parsing `'[' expression ']'` can deal with it. I can't ignore it generally, because ']' is not a factor and cannot start a factor. — Test User, Sep 30 '14 at 16:50
The predicates that check if characters are in the input stream need to be able to back up if what they want is not there. You'll additional trouble with whitespace: where is that consumed? (I've assumed it is automatically consumed by all character-checking routines. That may not be what you want, esp. if you want to keep comments). Your best bet is to go build a couple of parsers like this to get a feel for strenghts (sort of easy to code) vs. weaknesses. In the long term, you'll find using parser generators is easier, because they are stronger and handle more general cases. — Ira Baxter, Sep 30 '14 at 17:04
.... I build my first RD parser back in 1969. I still build them for very small grammars when I'm in a very big hurry. But 40 years of experience says, "Use a parser generator". You choose. — Ira Baxter, Sep 30 '14 at 17:06
Whitespace and comments are stripped out by my lexer. I have a full working version in bison/flex, a recursive descent with backtracking "recogniser" (just tells whether the input is valid or not) and now I am writing a predictive "recogniser". I am not entering the compiler-writing business. I just want to be able to write a parser if I _must_ use a particular language and there is no free (as in free beer) parser generator available, or I am disinclined to learn it for one-time use. I think I'll look back at my original predictive implementation. I probably need to revisit my error handling. — Test User, Sep 30 '14 at 18:37

Predictive parser for EBNF productions

1 Answers1