Grammar and top down parsing

Question

For a while I've been trying to learn how LL parsers work and, if I understand this correctly, when writing a top-down recursive descent parser by hand I should create a function for every non-terminal symbol. So for this example:

S -> AB
A -> aA|ε
B -> bg|dDe
D -> ab|cd

I'd have to create a function for each S, A, B and D like this:

Token B()
{
    if (Lookahead(1) == 'b')
    {
        Eat('b');
        Eat('g');
    }
    else if (Lookahead(1) == 'd')
    {
        Eat('d');
        D();
        Eat('e');
    }
    else
    {
        Error();
    }

    return B_TOKEN;
}

But then I try to do the same thing with the following grammar that I created to generate the same language as (a|b|c)*a regular expression:

S -> Ma
M -> aM|bM|cM|ε

That gives me following functions:

Token S()
{
    char Ch = Lookahead(1);
    if (Ch == 'a' || Ch == 'b' || Ch == 'c')
    {
        M();
        Eat('a');
    }
    else
    {
        Error();
    }

    return S_TOKEN;
}

Token M()
{
    char Ch = Lookahead(1);
    if (Ch == 'a' || Ch == 'b' || Ch == 'c')
    {
        Eat(ch);
        M();
    }

    return M_TOKEN;
}

And it doesn't seem to be good because for input 'bbcaa' M will consume everything, and after that S won't find the last 'a' and report an error, even though the grammar accepts it. It feels like M is missing the ε case, or maybe it is handled in the wrong way, but I'm not sure how to deal with it. Could anyone please help?

@Downvoter: This isn't about a particular language. He doesn't need a tag. — Ira Baxter, Feb 10 '17 at 20:10
Well, your S function should call M, check for acceptance, then check for a, but that's not what you wrote. See my SO answer on how to write a recursive descent parser: http://stackoverflow.com/questions/2245962/is-there-an-alternative-for-flex-bison-that-is-usable-on-8-bit-embedded-systems/2336769#2336769 — Ira Baxter, Feb 10 '17 at 20:12
@IraBaxter The OP talks about behavior he observed, which makes it seem like he has tested actual code, in which case a language tag would be totally appropriate. If "[...] M is missing the ε case [...]" is not about a flaw in the implementation but about a parsing technicality, this question is off-topic anyway and better suited on [Computer Science](https://cs.stackexchange.com). — cadaniluk, Feb 10 '17 at 20:18

rici · Accepted Answer · 2017-02-10T20:51:36.703

The behaviour of a top-down predictive parser is exactly as you note in your question. In other words, your second grammar is not suitable for top-down parsing (with one-token lookahead). Many grammars are not; that includes any grammar in which it is not possible to predict which production to use based on a finite lookahead.

In this case, if you were to lookahead two tokens instead of one, you could solve the conflict; M should predict the ε production on lookahead aEND, and the aM production on all other two-token lookaheads where the first token is a.

Grammar and top down parsing

1 Answers1