TinyPG doesn't properly parse this grammar, bug or bad grammar?

Question

I need to parse a simple language that I didn't design, so I can't change the language. I need the results in C#, so I've been using TinyPG because it's so easy to use, and doesn't require external libraries to run the parser.

Things had been going pretty well, until I ran into this construct in the language. (This is a simplified version, but it does show the problem):

EOF               -> @"^\s*$";
[Skip] WHITESPACE -> @"\s+";
LIST              -> "LIST";
END               -> "END";
IDENTIFIER        -> @"[a-zA-Z_][a-zA-Z0-9_]*";
Expr              -> LIST IDENTIFIER+ END;
Start             -> (Expr)+ EOF;

The resulting parser cannot parse this:

LIST foo BAR Baz END

because it greedily lexes END as an IDENTIFIER, instead of properly as the END keyword.

So, Here are my questions:

Is this grammar ambiguous or wrong for LL(1) parsing? Or is this a bug in TinyPG?
Is there any way to redesign the grammar such that TinyPG will properly parse the example line?
Are there any other suggestions for a simple parser that outputs code in C# and doesn't require additional libraries? I've looked at LLLPG and ANTLR4, but found them much more troublesome than TinyPG.

Just as a thought, are you sure that it's `IDENTIFIER` that is at fault of gobbling up `END`, and not `EOF`? In what way doesn't it parse? — David Arno, May 22 '15 at 08:17
Yes, the error from the parser is: (1,0): Unexpected token 'EOF' found. Expected END. And the Parse Tree is lists: IDENTIFIER 'END' as the last thing successfully parsed. — Jim, May 22 '15 at 08:47
Does TinyPG support look ahead/ look behind? If so, does changing `IDENTIFIER` to `(?!END\b)\b\w+` fix the problem? — David Arno, May 22 '15 at 09:02
Huh, that actually worked. It's not really TingPG that supports that, but the C# regex library it uses, but this worked: @"(?!END)[a-zA-Z_][a-zA-Z0-9_]*"; I still think this is a bug in in TinyPG (obviously you wouldn't want to have to match all possible keywords this way) but for my particular case, where the only possibly confusion is END, I think this will actually work. Thanks! — Jim, May 22 '15 at 09:18
I agree, it definitely seems a bug in TinyPG as the solution smells too like a hack to me. There is a fork on GitHub (https://github.com/SickheadGames/TinyPG), but that hasn't been updated for a year as well. I was going to say it might be worth raising an issue there, but assuming you are "jrleek", it looks like you did so already :) — David Arno, May 22 '15 at 09:34
Yeah, that was me. I've actually identified the problem area in the generated code, but I'm not 100% sure what the solution is in general. As I said, I'm pretty rusty on this stuff. I think it's either a lexer design bug, or a Reduce/Reduce conflict. In Yacc I think reduce/reduce conflicts are solved by the ordering of the rules, which I think TinyPG doesn't pay much attention to. — Jim, May 22 '15 at 13:13

score 0 · Answer 1 · answered Aug 16 '16 at 15:39

You might be the same guy since the issue looks identical, as the one I answered on GitHub, but here it is again for people who google this issue.

Here is an example from the Simple-CIL-compiler project, The identifier has to catch single words except the ones listed, which means you have to include the exception token's in to the identifier

IDENTIFIER-> @"[a-zA-Z_][a-zA-Z0-9_]*(?<!(^)(end|else|do|while|for|true|false|return|to|incby|global|or|and|not|write|readnum|readstr|call))(?!\w)";

Hope that helps.

(Link to Original post)

TinyPG doesn't properly parse this grammar, bug or bad grammar?

1 Answers1