Why parsing this program with BNFC fails?

Question

Given following grammar:

comment "/*" "*/" ;

TInt.  Type1 ::= "int" ;
TBool. Type1 ::= "bool" ;
coercions Type 1 ;


BTrue.   BExp   ::= "true" ;
BFalse.  BExp   ::= "false" ;

EOr.     Exp    ::= Exp  "||" Exp1 ;
EAnd.    Exp1   ::= Exp1 "&&" Exp2 ;
EEq.     Exp2   ::= Exp2 "==" Exp3 ;
ENeq.    Exp2   ::= Exp2 "!=" Exp3 ;
ELt.     Exp3   ::= Exp3 "<" Exp4 ;
EGt.     Exp3   ::= Exp3 ">" Exp4 ;
ELte.    Exp3   ::= Exp3 "<=" Exp4 ;
EGte.    Exp3   ::= Exp3 ">=" Exp4 ;
EAdd.    Exp4   ::= Exp4 "+" Exp5 ;
ESub.    Exp4   ::= Exp4 "-" Exp5 ;
EMul.    Exp5   ::= Exp5 "*" Exp6 ;
EDiv.    Exp5   ::= Exp5 "/" Exp6 ;
EMod.    Exp5   ::= Exp5 "%" Exp6 ;
ENot.    Exp6   ::= "!" Exp ;
EVar.    Exp8   ::= Ident ;
EInt.    Exp8   ::= Integer ;
EBool.   Exp8   ::= BExp ;
EIver.   Exp8   ::= "[" Exp "]" ;
coercions Exp 8 ;

Decl. Decl ::= Ident ":" Type ;
terminator Decl ";" ;


LIdent.  Lvalue ::= Ident ;

SBlock.  Stm ::= "{" [Decl] [Stm] "}" ;
SExp.    Stm ::= Exp ";" ;
SWhile.  Stm ::= "while" "(" Exp ")" Stm ;
SReturn. Stm ::= "return" Exp ";" ;
SAssign. Stm ::= Lvalue "=" Exp ";" ;
SPrint.  Stm ::= "print" Exp ";" ;
SIf.     Stm ::= "if" "(" Exp ")" "then" Stm "endif" ;
SIfElse. Stm ::= "if" "(" Exp ")" "then" Stm "else" Stm "endif" ;

terminator Stm "" ;

entrypoints Stm;

parser created with bnfc fails to parse

{ c = a; }

although it parses

c = a;

or

{ print a; c = a; }

I think it could be a problem that parser sees Ident and doesn't know whether it's declaration or statement, LR stuff etc (still one token of lookeahed should be enough??). However I couldn't find any note in BNFC documentation that would say that it doesn't work for all grammars.

Any ideas how to get this working?

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

I would think you would get a shift/reduce conflict report for that grammar, although where that error message shows up might well depend on which tool BNFC is using to generate the parser. As far as I know, all the backend tools have the same approach to dealing with shift/reduce conflicts, which is to (1) warn the user about the conflict, and then (2) resolve the conflict in favour of shifting.

The problematic production is this one: (I've left out type annotations to reduce clutter)

Stm ::= "{" [Decl] [Stm] "}" ;

Here, [Decl] and [Stm] are macros, which automatically produce definitions for the non-terminals with those names (or something equivalent which will be accepted by the backend tool). Specifically, the automatically-produced productions are:

[Decl] ::= /* empty */
       |   Decl ';' [Decl]

[Stm]  ::= /* empty */
       |   Stm [Stm]

(The ; in the first rule is the result of a "terminator" declaration. I don't know why BNFC generates right-recursive rules, but that's how I interpret the reference manual -- after a very quick glance -- and I'm sure they have their reasons. For the purpose of this problem, it doesn't matter.

What's important is that both Decl and Stm can start with an Ident. So let's suppose we're parsing { id ..., which might be { id : ... or { id = ..., but we've only read the { and the lookahead token id. So there are two possibilities:

id is the start of a Decl. We should shift the Ident and go to the state which includes Decl → Ident • ':' Type
id is the start of a Stm. In this case, we need to reduce the production [Decl] → • before we shift Ident into a Stm production.

So we have a shift/reduce conflict, because we cannot see the second next token (either : or =). And, as mentioned above, shift usually wins in this case, so the LR(1) parser will commit itself to expect a Decl. Consequently, { a = b ; } will fail.

An LR(2) parser generator would do fine with this grammar, but those are much harder to find. (Modern bison can produce GLR parsers, which are even more powerful than LR(2) at the cost of a bit of extra compute time, but not the version required by the BNFC tool.)

Possible solutions

Allow declarations to be intermingled with statements. This one is my preference. It is simple, and many programmers expect to be able to declare a variable at first use rather than at the beginning of the enclosing block.
Make the declaration recognizable from the first token, either by putting the type first (as in C) or by adding a keyword such as var (as in Javascript):
Modify the grammar to defer the lookahead. It is always possible to find an LR(1) grammar for any LR(k) language (provided k is finite), but it can be tedious. An ugly but effective alternative is to continue the lexical scan until either a : or some other non-whitespace character is found, so that id : gets tokenized as IdentDefine or some such. (This is the solution used by bison, as it happens. It means that you can't put comments between an identifier and the following :, but there are few, if any, good reasons to put a comment in that context.

Thank you for this precise answer. Now I am sure I understand what shift and reduce mean. I guess I will just choose other terminator... or put the type first in declaration. — kitek, May 17 '15 at 11:44
@kitek: The terminator is irrelevant here. Putting the type first would work if you can distinguish types from identifiers in the lexical analysis. Alternatively, you could allow declarations and statements to be freely intermingled instead of insisting that declarations come first. (That would be my preference and perhaps I should have included that in the answer.) — rici, May 17 '15 at 16:10
You are right. I decided to start each declaration with "var", so no ambiguity possible. — kitek, May 17 '15 at 16:38
@kitek: That's another solution. It's not really an ambiguity; the grammar is unambiguous, but it requires more than one token of lookahead. In fact, it is possible, though tedious, to create an LR(1) grammar for the language, but I don't believe BNFC will help you there. — rici, May 17 '15 at 16:42
You are right once again, I ment "reduce shift conflict". I know, that I could achieve it by introducing rule DecideIdentLater ::= Ident Rest, Rest ::= ..., but it appears much easier to just add "var" :). — kitek, May 17 '15 at 16:51
@kitek: Indeed. Edited the answer to include this discussion. — rici, May 17 '15 at 17:02

Why parsing this program with BNFC fails?

1 Answers1

Possible solutions