Recovering error tokens in parsing (Lemon)

Question

I'm using Lemon as a parser generator, its error handling is the same as yacc's and bison's if you don't know Lemon.

Lemon has an option to define the error token in a set of rules in order to catch parsing errors. The default behavior of the generated parser is to destroy the token causing the error; is there any way to override this behavior so that I can keep the token?

Here's an example to show what's happening: basically I'm appending the tokens for each rule together to reform the input string, here's an example grammar:

input ::= string(A) { printf("%s", A); } // Print the result
string(A) ::= string(B) part(C). { A = append(B, C); }
string(A) ::= part(B). { A = B; }
part(A) ::= NUMBER(B) NAME(C). { A = append(C, B); } // Rearrange the number and name
part(A) ::= error(B). { A = B; } // On error keep the token anyways

On input:

"Username 1234Joseph"

I get output:

"Joseph1234"

Because the text "Username " is junked by the parser in the part(A) ::= error(B) rule, but I really want:

"Username Joseph1234"

as output.

If you can solve this problem in bison or another parser generator I would accept that as an answer :)

score 2 · Accepted Answer · answered Jun 06 '12 at 17:57

With yacc/bison, a parsing error drops the tool into error recovery mode, if possible. It will attempt to discard tokens on its way to a "clean" state.

I'm unable to find a reference for lemon, so I can't show some lemon code to fix this, but with yacc/bison, one would use the rules here.

Namely, you need to adjust your error rule to state that the parser is ok with yyerrok to prevent it from dropping tokens. Next, it will attempt to reread the "bad" token, so you need to clear it with yyclearin. Finally, since the rule attached to your error code contains the contents of your token, you will need to set up a function that adjusts your input stack, by taking the current token contents and creating a new (proper) token with the same contents.

As an example, if a grammar defined as MyOther MyOther saw MyTok MyOther:

stack
MyTok: "the text"
MyOther: "new text"

stack
MyOther: "the text"
MyOther: "new text"

To accomplish this, look into using yybackup. I'm unable to find an alternative method, though yybackup is frowned upon.

In case anyone was interested I ended up switching over to RE2C [[link](http://re2c.org/)]. It was much easier to get the behaviour with RE2C macros. I wrote the macros so that any unmatched substrings are simply output while any matched substrings are modified by RE2C. — Sadly Not, May 05 '15 at 17:29

score 2 · Answer 2 · answered May 27 '11 at 20:25

2

It's an old one, but why not...

The grammar must include spaces. At the moment the grammar only allows a sequence of NUMBER NAME tokens (without any space between the tokens).

answered May 27 '11 at 20:25

Omri Barel

9,182
3
29
22

1

There are badges (Necromancer and Revival) for answering old questions and getting up votes, so there's every reason to answer older questions without an answer (or without a good answer). – Jonathan Leffler May 28 '11 at 00:07
1

The lexical analyzer presumably deals with spaces between tokens, etc. That is a standard division of labour - the lexical analyzer handles comments and blanks and strings; the grammar deals with the tokens found by the lexical analyzer that are not eaten by it. – Jonathan Leffler May 28 '11 at 00:09
@Jonathan Leffler, I couldn't make that assumption based on the question. The token sequence NUMBER NAME is expected to catch 1234Joseph, but usually that would not be the case (1234Joseph would not be a legal token). I hope you see what I mean with respect to spaces. – Omri Barel May 28 '11 at 18:49
fair enough - without knowing how the lexical analyzer analyzes stuff, it is hard to be sure what is going on. – Jonathan Leffler May 28 '11 at 18:52

Recovering error tokens in parsing (Lemon)

2 Answers2