1

Given input with repeating BLOCKs, where each block has repeating BEGIN EVENT and END EVENT entries (END EVENT always follows a BEGIN EVENT):

[TIMESTAMP] BLOCK
[TIMESTAMP] BEGIN EVENT
[TIMESTAMP] END EVENT
[TIMESTAMP] BEGIN EVENT
[TIMESTAMP] END EVENT
...
[TIMESTAMP] BLOCK

How do you disambiguate this grammar with LR(1)? I'm using LALRPOP, and the minimum example of this is:

Timestamp = "[TIMESTAMP]";
BlockHeader = Timestamp "BLOCK";
Begin = Timestamp "BEGIN" "EVENT";
End = Timestamp "END" "EVENT";

Block = BlockHeader (Begin End)+;
pub Blocks = Block*

Because LR(1) can only look one token ahead though, this grammar is ambigious, as LALRPOP helpfully tells you (partial error):

Local ambiguity detected

  The problem arises after having observed the following symbols in the input:
    BlockHeader (Begin End)+
  At that point, if the next token is a `"[TIMESTAMP]"`, then the parser can proceed in two different ways.

  First, the parser could execute the production at
  /home/<snip>.lalrpop:51:9: 51:32, which would consume
  the top 2 token(s) from the stack and produce a `Block`. This might then yield a parse tree like
    BlockHeader (Begin End)+ Block
    ├─Block────────────────┤     │
    ├─Block+───────────────┘     │
    └─Block+─────────────────────┘

  Alternatively, the parser could shift the `"[TIMESTAMP]"` token and later use it to construct a
  `Timestamp`. This might then yield a parse tree like
    (Begin End)+ "[TIMESTAMP]" "BEGIN" "EVENT" End
    │            ├─Timestamp─┘               │   │
    │            └─Begin─────────────────────┘   │
    └─(Begin End)+───────────────────────────────┘

I see that it is telling me, after parsing a BlockHeader, Begin and End it can't figure out if the next token is another Begin, or the start of another Block. I have not found a way to disambiguate this in LR(1), but I can only assume this is a lack of understanding on my part though, and not an inherit limitation of LR(1) grammars?

Alec Thilenius
  • 524
  • 5
  • 12
  • What about `BlockBody = (Begin End)+; Block = BlockBody BlockHeader; pub Blocks = BlockHeader Block+ BlockBody;`? – yyyy Feb 10 '19 at 05:12
  • Unfortunately it's still ambiguous, [error output](https://gist.github.com/AThilenius/f202e12f5bd5e10f8e6cde2e30ca6d1e). I'm starting to think it's not possible in one pass with an LR(1) grammar :,( – Alec Thilenius Feb 10 '19 at 17:36

1 Answers1

2

Unfortunately this kind of 'need more lookahead' problem is hard to solve without completely restructuring the grammar, which often loses desirable structure of the input, and sometime accepts degenerate inputs the original grammar would reject. You can usually reject those inputs and get that structure back by post-processing the parse tree, but that is more work. In your case, the grammar:

Timestamp = "[TIMESTAMP]";
BlockHeader = Timestamp "BLOCK";
Begin = Timestamp "BEGIN" "EVENT";
End = Timestamp "END" "EVENT";
Event = Begin End;
Item = BlockHeader | Event;
pub Input = Item*

should do the trick, but has the problem that it loses the block structure (instead giving you an unstructured sequence of block headers and events), and it accepts empty blocks. You can readily deal with both problems by post-processing the list of items.

An alternative option when the required lookahead is small and bounded is to deal with it in your tokenizer. I'm not familiar with LALRPOP, but it should be possible to "combine" the [TIMESTAMP] tokens with the immediately following keyword tokens (so the timestamps would not be present in the grammar, instead just being an attribute on the keywords), in which case everything would work fine with single token lookahead.

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • A detailed and thoughtful answer, thank you Chris! I started looking into combining tokens as well, but still want to parse the actual timestamp [YYYY-MM-DD HH:MM] in the same pass. If my (very murky but solidifying) understanding of parsers is correct, than this is simply not possible in LR(1) because it by definition will only ever look 1 token ahead, and 1 token ahead will always be ambiguous. I played with some PEG grammars as well, which I like a lot more but the libraries for them aren't as slick as LALRPOP, which combines AST generation into parsing. – Alec Thilenius Feb 10 '19 at 17:30
  • If you need to parse the timestamp as well, that's probably too much lookahead to do easily in the lexer, so you're probably best off with the approach above, parsing the input into a simple list of items and then post-processing that into blocks. Depending on your parser generator that may be doable on-the-fly (parsing into a queue of items and processing that queue as you go, just needing some limited lookahead in the queue). You might even be able to structure it as two-level parser (one parser that generates a sequence of items that are considered as tokens to a second parser.) – Chris Dodd Feb 10 '19 at 19:16