I'm parsing a language that doesn't have statement terminators like ;
. Expressions are defined as the longest sequence of tokens, so 5-5
has to be parsed as a subtraction, not as two statements (literal 5
followed by a unary negated -5
).
I'm using LALRPOP as the parser generator (despite the name, it is LR(1) instead of LALR, afaik). LALRPOP doesn't have precedence attributes and doesn't prefer shift over reduce by default like yacc would do. I think I understand how regular operator precedence is encoded in an LR grammar by building a "chain" of rules, but I don't know how to apply that to this issue.
The expected parses would be (individual statements in brackets):
"5 - 5" → 5-5 instead of 5, -5
"5 (- 5)" → 5, -5
"- 5" → -5
"5 5" → 5, 5
How do I change the grammar such that it always prefers the longer parse?
Going through the first few pages of google results as well as stack overflow didn't yield any results for this specific problem. Most related questions need more lookahead or the result is to not allow consecutive statements without terminators.
I created a minimal sample grammar that reproduces the shift/reduce conflict (a statement in this grammar is just an expression, in the full grammar there would also be "if", "while", etc. and more levels of operator precedence, but I've omitted them for brevity). Besides unary minus, there are also other conflicts in the original grammar like print(5)
, which could be parsed as the identifier print
and a parenthesized number (5)
or a function call. There might be more conflicts like this, but all of them have the same underlying issue, that the longer sequence should be preferred, but both are currently valid, though only the first should be.
For convenience, I created a repo (checkout and cargo run
). The grammar is:
use std::str::FromStr;
grammar;
match {
"+",
"-",
"(",
")",
r"[0-9]+",
// Skip whitespace
r"\s*" => { },
}
Expr: i32 = {
<l:Expr> "+" <r:Unary> => l + r,
<l:Expr> "-" <r:Unary> => l - r,
Unary,
};
Unary: i32 = {
"-" <r:Unary> => -r,
Term,
}
Term: i32 = {
Num,
"(" <Expr> ")",
};
Num: i32 = {
r"[0-9]+" => i32::from_str(<>).unwrap(),
};
Stmt: i32 = {
Expr
};
pub Stmts: Vec<i32> = {
Stmt*
};
Part of the error (full error message):
/lalrpop-shift-repro/src/test.lalrpop:37:5: 37:8: Local ambiguity detected
The problem arises after having observed the following symbols in the input:
Stmt+ Expr
At that point, if the next token is a `"-"`, then the parser can proceed in two different ways.
First, the parser could execute the production at
/lalrpop-shift-repro/src/test.lalrpop:37:5: 37:8, which would consume
the top 1 token(s) from the stack and produce a `Stmt`. This might then yield a parse tree like
Expr ╷ Stmt
├─Stmt──┤ │
├─Stmt+─┘ │
└─Stmt+──────┘
Alternatively, the parser could shift the `"-"` token and later use it to construct a `Expr`. This might
then yield a parse tree like
Stmt+ Expr "-" Unary
│ ├─Expr───────┤
│ └─Stmt───────┤
└─Stmt+────────────┘
See the LALRPOP manual for advice on making your grammar LR(1).