reduce/reduce conflict in bison grammar

Question

The following bison grammar yields a reduce/reduce conflict :

%token LEFT_PARENTHESIS
%token RIGHT_PARENTHESIS
%token NAME
%token RELATION_INFIX

%%

formula:
  term RELATION_INFIX term 
| NAME
| LEFT_PARENTHESIS formula RIGHT_PARENTHESIS
;

term:
  NAME
| LEFT_PARENTHESIS term RIGHT_PARENTHESIS
;

I do not see where the ambiguity is. LEFT_PARENTHESIS NAME RIGHT_PARENTHESIS must be a formula, because there is no RELATION_INFIX.

The beginning of bison's verbose output is

State 0

    0 $accept: . formula $end

    LEFT_PARENTHESIS  shift, and go to state 1
    NAME              shift, and go to state 2

    formula  go to state 3
    term     go to state 4


State 1

    3 formula: LEFT_PARENTHESIS . formula RIGHT_PARENTHESIS
    5 term: LEFT_PARENTHESIS . term RIGHT_PARENTHESIS

    LEFT_PARENTHESIS  shift, and go to state 1
    NAME              shift, and go to state 2

    formula  go to state 5
    term     go to state 6


State 2

    2 formula: NAME .
    4 term: NAME .

    RIGHT_PARENTHESIS  reduce using rule 2 (formula)
    RIGHT_PARENTHESIS  [reduce using rule 4 (term)]
    RELATION_INFIX     reduce using rule 4 (term)
    $default           reduce using rule 2 (formula)

So it seems confused by a RIGHT_PARENTHESIS. And the states seem to accept NAME RIGHT_PARENTHESIS, even though the grammar requires a LEFT_PARENTHESIS before.

Any ideas ?

EDIT

The grammar above is simplified to focus on the reduce/reduce conflict. My real intention is to parse first-order logic formulas : terms are combinations of operators and formulas are logic combinations (not, and, or, implies) of relations on terms.

Terms and formulas are sometimes long expressions, so we give them names to call them more easily. All names are eventually expanded to their trees of connectors and operators, later in the program (not in the bison parser).

Among the terminal symbols are the variables (the objects modeled by the logic) ; the formulas are statements about those objects. It is similar to types and functions in C++ : the same naming scheme can be used for all of them.

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

There is no ambiguity. That does not mean there is no parsing conflict.

An ambiguous grammar cannot be parsed with an LR(k) parser; attempting to generate the parsing automaton will necessarily fail because of the existence of two different derivations for the same sentence. But the reverse is not true. A grammar with a conflict may or may not be ambiguous. [Note 1]

To be parsed by an LR(k) grammar, every non-terminal needs to be resolved before more than k more symbols are read.

So, while you are quite correct that

LEFT_PARENTHESIS NAME RIGHT_PARENTHESIS must be a formula, because there is no RELATION_INFIX.

that does not tell the whole story. For LEFT_PARENTHESIS NAME RIGHT_PARENTHESIS to be a formula, NAME must also be a formula, and the reduction of NAME to formula must occur within 1 symbol (because k is always one in a bison LR parser). But only reading one symbol after the NAME, it is not possible to know that "there is no RELATION_INFIX". Hence the reduce-reduce conflict.

If I understand your note correctly, what you have is basically an expression grammar where there are two types of expressions ("formulas" and "terms") with different operators, where identifiers can be primitive variables or the name of an expression of either type. A restriction is that the operand of a "term" operator can only be a "term", while the operand of a "formula" operator can be either a "formula" or a "term". (That is not reflected in the simplified grammar, which does not allow term AND (term AND term) (with or without the parentheses); I'm just assuming that it is desired because otherwise it would be impossible for formulas to be complicated, but your note says that they often are.)

If a phrase includes a visible operator, the syntactic category ("formula" or "term") is evident, but a syntactic phrase consisting only of an identifier, possibly surrounded by redundant parentheses, can be either a "term" or a "formula". That makes it essentially impossible to completely express the restriction syntactically without knowing the syntactic category of each identifier.

If the syntactic category of every identifier is known before the first use of the identifier, then lexical feedback could be used to allow complete enforcement of the restriction during parsing. Otherwise, or if lexical feedback is considered undesirable, it will be necessary to validate any subexpression containing an identifier during a subsequent semantic analysis pass, perhaps the same one which expands macro identifiers.

Since a semantic check needs to be made anyway, it is easy to also check that a formula is never the operand to a term operator. In that case an ordinary expression grammar can be used, including normal use of precedence declarations to simplify the presentation. This grammar will directly produce the correct AST for any correct input; the semantic check is only required to reject incorrect inputs. Doing the detection in this way also helps produce informative error messages. There is really very little down-side to this solution, IMHO.

But it is also reasonably straight-forward to write a grammar which expresses the syntactic restriction for cases other than names. The trick is to use three expression types, since there really are three syntactic categories, as shown by the above discussion:

evident formulas
evident terms
phrases which might be either formulas or terms (i.e. possibly parenthesized names)

Getting parenthesized subexpressions right is a bit annoying and adds clutter to the grammar. To my mind, there is no real benefit. But if you want to try this, here's a simple example, using exactly one term operator and one formula operator and sample reduction actions which insert explicit check/conversion operations into the AST:

formula: formula_not_term
       | term_not_name     { $$ = mkform("TERM->FORM", $1, NULL); }
       | name              { $$ = mkform("NAME->FORM", $1, NULL); }
formula_not_term
       : formula FORMULA_OPERATOR formula
                           { $$ = mkform("FORM_OP", $1, $3); }
       | '(' formula_not_term ')'
                           { $$ = $2; }
term   : term_not_name
       | name              { $$ = mkterm("NAME->TERM", $1, NULL); }
term_not_name
       : term TERM_OPERATOR term
                           { $$ = mkterm("TERM_OP", $1, $3); }
       | '(' term_not_name ')'
                           { $$ = $2; }
name   : NAME
       | '(' name ')'      { $$ = $2; }

Notes

Determining whether a grammar ambiguous is undecidable, which means that no algorithm exists which will definitely answer the question for any grammar. If all ambiguous grammars produced LR(1) parsing conflicts, you could use the LR(1) parser construction to detect ambiguity, which would contradict the undecidability. So fnding unambiguous grammars which produce parsing conflicts is to be expected.

@v.semeria do not put essential information in comments. Edit the question. You should consider comments to be ephemeral; they could be deleted or moved at any moment. — rici, Jan 03 '18 at 16:02
@v.semeria that's basically what I was going to suggest, since you will need to type-check variables in your semantic processing anyway, unless you implement lexical feedback. Let me try to dig out a similar answer... — rici, Jan 03 '18 at 17:10
@v.semeria: perhaps this one is useful: https://stackoverflow.com/a/43308337/1566221 — rici, Jan 03 '18 at 17:13
The "crude" grammar produces an accurate parse for all correct inputs; it just does not detect all type errors. The semantic type checking is almost trivial, using a simple depth-first tree traversal. So why complicate the grammar? — rici, Jan 03 '18 at 20:11
@V.Semeria: Add above comment and an outline sketch of a grammar to the answer. — rici, Jan 04 '18 at 06:25
Thanks again for this solution. Your grammar is slighty too complicated : the operand of a formula operator must be a formula. Consider the set theory as a foundation of mathematics : terms are sets and formulas are statements abouts sets, like theorems. Term operators can be +, - or any binary mathematical function. Formula operators are logical connectives not, and, or, implies. In addition to term operators, there are term relations, like 2 \in Naturals. So `term TERM_RELATION term` is a formula. — V. Semeria, Jan 04 '18 at 14:27
@v.semeria: ok, that wasn't clear to me. But the general outline should handle your use case, if that's the route you want to take. — rici, Jan 04 '18 at 14:34

score 1 · Answer 2 · answered Jan 03 '18 at 13:28

1

Following your grammar LEFT_PARENTHESIS NAME RIGHT_PARENTHESIS may match either a formula or a term.

As it concerns reducing input in a particular state, it is not guaranteed to succeed (and will not in the case of a NAME RIGHT_PARENTHESIS expression). The state progression says that it is the only unambiguous direction possible. Failing doing so on an incorrect input is not a breaking point from the grammar perspective. But having several reduce alternatives definitely is.

answered Jan 03 '18 at 13:28

jszpilewski

1,632
1
21
19

The first rule of the grammar is formula, so shouldn't `LEFT_PARENTHESIS NAME RIGHT_PARENTHESIS` be parsed as a formula ? Do you have an example of a valid input for the grammar that has 2 possible parsing paths ? – V. Semeria Jan 03 '18 at 13:39
Besides, if you delete the 2 parentheses rules, the reduce/reduce conflict disappears. Even if NAME could be both term and formula. – V. Semeria Jan 03 '18 at 13:58
Bison is a LALR (bottom-up) parser so it first collects tokens and tries to match (reduce) them to available symbols. The order in which those symbols are defined in the grammar does not matter. – jszpilewski Jan 03 '18 at 14:00
I doubt that. The first grammar rule is special : it is the entry point of the parser. Try parsing something that matches the second rule in a simple grammar, bison will report a syntax error. – V. Semeria Jan 03 '18 at 14:12
Single NAME follows State 0 - State 2 - $default path as this is the final shift and formula is the starting symbol which can be identified with one token look ahead rule. The symbol term may be followed by RELATION_INFIX (on the left side) which is ok due to look ahead but may be the final one like formula (on the right side) which presents a conflict as no further hints are available. So having clause like "term RELATION_INFIX" probably could fix the grammar as well. – jszpilewski Jan 03 '18 at 14:51

reduce/reduce conflict in bison grammar

2 Answers2

Notes