How to overcome shift-reduce conflict in LALR grammar

Question

I am trying to parse positive and negative decimals.

number(N) ::= pnumber(N1).

number(N) ::= nnumber(N1).

number(N) ::= pnumber(N1) DOT pnumber(N2).

number(N) ::= nnumber(N1) DOT pnumber(N2).

pnumber(N) ::= NUMBER(N1).

nnumber(N) ::= MINUS NUMBER(N1).

The inclusion of the first two rules gives a shift/reduce conflict but I don't know how I can write the grammar such that the conflict never occurs. I am using the Lemon parser.

Edit: conflicts from .out file

State 79:
     (56) number ::= nnumber *
          number ::= nnumber * DOT pnumber

                           DOT shift        39
                           DOT reduce       56      ** Parsing conflict **
                     {default} reduce       56     number ::= nnumber

State 80:
     (55) number ::= pnumber *
          number ::= pnumber * DOT pnumber

                           DOT shift        40
                           DOT reduce       55      ** Parsing conflict **
                     {default} reduce       55     number ::= pnumber
State 39:
          number ::= nnumber DOT * pnumber
          pnumber ::= * NUMBER

                        NUMBER shift-reduce 59     pnumber ::= NUMBER
                       pnumber shift-reduce 58     number ::= nnumber DOT pnumber

State 40:
          number ::= pnumber DOT * pnumber
          pnumber ::= * NUMBER

                        NUMBER shift-reduce 59     pnumber ::= NUMBER
                       pnumber shift-reduce 57     number ::= pnumber DOT pnumber

Edit 2: Minimal grammar that causes issue

start ::= prog.
prog ::= rule.
rule ::= REVERSE_IMPLICATION body DOT.
body ::= bodydef.
body ::= body CONJUNCTION bodydef.
bodydef ::= literal.
literal ::= variable.
variable ::= number.
number ::= pnumber.
number ::= nnumber.
number ::= pnumber DOT pnumber.
number ::= nnumber DOT pnumber.
pnumber ::= NUMBER.
nnumber ::= MINUS NUMBER.

Please produce an MCVE ([MCVE]) for it. With the `(N)` labels removed (they're not used), the fragment given does not produce a shift reduce conflict with the Lemon from SQLite 3.14.0. At minimum, you should look in the `.out` file and include the relevant conflict information in the question, but it would be best to produce a grammar fragment that can be copied and that reproduces the problem. Then we can help you (probably). Until then, we're stuck. — Jonathan Leffler, Aug 16 '16 at 23:53
@JonathanLeffler I added the output of .out file where the conflicts occur. The (N) labels are actually used but I removed them from the question for brevity. — Pdksock, Aug 17 '16 at 00:22
Of course the labels are used in your real grammar — but they're a nuisance when trying to help you. You need to work on the MCVE-ization of your code. The shifts to 39 and 40 are relevant — but we don't know what rules those are a part of. The trouble may be that your grammar needs to lookahead two tokens — what does it get after the DOT. If that is the problem, you most likely need to enhance your lexical analyzer so it, rather than the grammar, handles decimals versus integers, returning different tokens (e.g. NUMBER as now and DECIMAL). Note that leading zeros are significant after DOT. — Jonathan Leffler, Aug 17 '16 at 00:28
@JonathanLeffler I have added a fraction of grammar which causes this issue. The first line has DOT at the end and that is why I think the shif/reduce error is occuring. — Pdksock, Aug 17 '16 at 00:37
Hmmm…the fraction/fragment has no dots at the ends of lines, and the `*` causes confusion to my Lemon. When those are fixed (`.` added, `*` removed), Lemon complains about no rule for `pnumber` or `nnumber`. When I add the last two lines of the original grammar fragment (sans labels), then it compiles without a S/R conflict. I'm not seeing a sequence that leads to problems, though I agree that it is likely that the `rule ::= body DOT.` rule is part of the trouble. — Jonathan Leffler, Aug 17 '16 at 00:44
@JonathanLeffler This definitely won't run in lemon. It is a part of the .out file from which I took out the relevant info. `pnumber` and `nnumber` are `NUMBER` — Pdksock, Aug 17 '16 at 00:48
You need to be able to copy/paste what you post as a grammar fragment into a file `so-3898-5824.y` and then run `lemon so-3898-5824.y` and get the S/R conflict — that's what an MCVE means. Without that, we're flying blind. The only gotcha is the M part of MCVE — what you post should be minimal. Note that the `-g` option may help; it removes the actions from the grammar, leaving just the rules. — Jonathan Leffler, Aug 17 '16 at 00:48
@JonathanLeffler I have added minimal grammar as edit 2 which will cause conflict on running with lemon. — Pdksock, Aug 17 '16 at 01:10

score 3 · Accepted Answer · edited Feb 10 '19 at 05:26

The conflicts you show indicate a problem with how the number non-terminal is used, not with number itself.

The basic problem is that after seeing a pnumber or nnumber, when the next token of lookahead is a DOT, it can't decide if that should be the end of the number (reduce, so DOT is part of some other non-terminal after the number), or if the DOT should be treated as part of the number (shifted so it can later reduce one of the p/nnumber DOT pnumber rules.)

So in order to diagnose the problem, you'll need to show all the rules that use number anywhere on the right hand side (and recursively any other rules that use any of those rules' non-terminals on the right).

Note that it is rarely useful to post just a fragment of a grammar, as the LR parser construction process depends heavily on the context of where the rules are used elsewhere in the grammar...

So the problem here is that you need two-token lookahead to differentiate between a DOT in a (real) number literal and a DOT at the end of a rule.

The easy fix is to let the lexer deal with it -- lexers can do small amounts of lookahead quite easily, so you can recognize REAL_NUMBER as a distinct non-terminal from NUMBER (probably still without the -, so you'd end up with

number ::= NUMBER | MINUS NUMBER | REAL_NUMBER | MINUS REAL_NUMBER

It's much harder to remove the conflict by factoring the grammar but it can be done.

In general, to refactor a grammar to remove a lookahead conflict, you need to figure out the rules that manifest the conflict (rule and number here) and refactor things to bring them together into rules that have common prefixes until you get far enough along to disambiguate.

First, I'm going to assume there are other rules besides number that can appear here, as otherwise we could just eliminate all the intervening rules.

variable ::= number | name

We want to move the number rule "up" in the grammar to get it into the same place as rule with DOT. So we need to split the containing rules to special case when they end with a number. We add a suffix to denote the rules that correspond to the original rule with all versions that end in a number split off

variable ::= number | variable_n
variable_n ::= name

...and propagate that "up"

literal ::= number | literal_n
literal_n ::= variable_n

...and again

bodydef ::= number | bodydef_n
bodydef_n := literal_n

...and again

body ::= number | body_n
body := body CONJUNCTION number
body_n ::= bodydef_n
body_n ::= body CONJUNCTION bodydef_n

Notice that as you move it up, you need to split up more and more rules, so this process can blow up the grammar quite a bit. However, rules that are used only at the end of a rhs that you're refactoring will end up only needing the _n version, so you don't necessarily have to double the number of rules.

...last step

rule ::= REVERSE_IMPLICATION body_n DOT
rule ::= REVERSE_IMPLICATION number DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION number DOT

Now you have the DOTs in all the same places, so expand the number rules:

rule ::= REVERSE_IMPLICATION body_n DOT
rule ::= REVERSE_IMPLICATION integer DOT
rule ::= REVERSE_IMPLICATION integer DOT pnumber DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION integer DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION integer DOT pnumber DOT

and the shift-reduce conflicts are gone, because the rules have common prefixes up until past the needed lookahead to determine which to use. I've reduced the number of rules in this final expansion by adding

integer ::= pnumber | nnumber

What I meant was with input `DOT` it wouldn't know whether to reduce `rule` or use it as a part of `number` — Pdksock, Aug 17 '16 at 00:50
Depends on whether it is an ambiguity problem or a lookahead problem. For a lookahead problem (which this superficially looks to be), you need to refactor the grammar to defer the decision as to which way to go. — Chris Dodd, Aug 17 '16 at 00:56
In fragment you posted as Edit 2, there should be no problem, as `body` cannot end in a `number` (must end with an `RBRACKET`), and a `number` can only be followed by a `COMMA` or `RBACKET`. — Chris Dodd, Aug 17 '16 at 01:00
Yeah that example was wrong actually. I have added a minimal grammar which is causing conflicts as edit 2 — Pdksock, Aug 17 '16 at 01:11
One other advantage of resolving real numbers vs integers in the lexical analyzer is that typically white space is ignored between full tokens, so the RHS of rules such as `pnumber DOT pnumber` are apt to allow spaces before and after the dot, whereas the lexical analyzer can usually reject `123 . 456` (but accept `123.456`). — Jonathan Leffler, Feb 10 '19 at 05:32

Kaz · Answer 2 · 2016-08-19T00:35:54.940

You have to declare the associativity of the DOT operator token with %left or %right.

Or, another idea is to drop this intermediate reduction. The obvious feature in your grammar is that numbers grow by DOT followed by a number. That can be captured with a single rule:

number : number DOT NUMBER

A number followed by a DOT followed by a NUMBER token is still a number.

This rule doesn't require DOT to have an associativity declared, because there is no ambiguity; the rule is purely left-recursive, and the right hand of DOT is a terminal token. The parser must reduce the top of the stack to number when the state machine is at this point, and then shift DOT:

number : number DOT NUMBER

The language which you are parsing here is regular; it can be parsed by regular expressions without any recursion. That is why rules that have both left and right recursion in them and require associativity to be declared are somewhat of a "big hammer".

Currently I am parsing `number DOT NUMBER` with regular expression as a part of my solution. However I will give `DOT` left associativity and see if that works. — Pdksock, Aug 19 '16 at 20:26

How to overcome shift-reduce conflict in LALR grammar

2 Answers2