2

I need to parse a simple language that I didn't design, so I can't change the language. I need the results in C#, so I've been using TinyPG because it's so easy to use, and doesn't require external libraries to run the parser.

Things had been going pretty well, until I ran into this construct in the language. (This is a simplified version, but it does show the problem):

EOF               -> @"^\s*$";
[Skip] WHITESPACE -> @"\s+";
LIST              -> "LIST";
END               -> "END";
IDENTIFIER        -> @"[a-zA-Z_][a-zA-Z0-9_]*";
Expr              -> LIST IDENTIFIER+ END;
Start             -> (Expr)+ EOF;

The resulting parser cannot parse this:

LIST foo BAR Baz END

because it greedily lexes END as an IDENTIFIER, instead of properly as the END keyword.

So, Here are my questions:

  1. Is this grammar ambiguous or wrong for LL(1) parsing? Or is this a bug in TinyPG?

  2. Is there any way to redesign the grammar such that TinyPG will properly parse the example line?

  3. Are there any other suggestions for a simple parser that outputs code in C# and doesn't require additional libraries? I've looked at LLLPG and ANTLR4, but found them much more troublesome than TinyPG.

Nazmul
  • 575
  • 3
  • 18
Jim
  • 651
  • 2
  • 7
  • 15
  • Just as a thought, are you sure that it's `IDENTIFIER` that is at fault of gobbling up `END`, and not `EOF`? In what way doesn't it parse? – David Arno May 22 '15 at 08:17
  • Yes, the error from the parser is: (1,0): Unexpected token 'EOF' found. Expected END. And the Parse Tree is lists: IDENTIFIER 'END' as the last thing successfully parsed. – Jim May 22 '15 at 08:47
  • 1
    Does TinyPG support look ahead/ look behind? If so, does changing `IDENTIFIER` to `(?!END\b)\b\w+` fix the problem? – David Arno May 22 '15 at 09:02
  • Huh, that actually worked. It's not really TingPG that supports that, but the C# regex library it uses, but this worked: @"(?!END)[a-zA-Z_][a-zA-Z0-9_]*"; I still think this is a bug in in TinyPG (obviously you wouldn't want to have to match all possible keywords this way) but for my particular case, where the only possibly confusion is END, I think this will actually work. Thanks! – Jim May 22 '15 at 09:18
  • I agree, it definitely seems a bug in TinyPG as the solution smells too like a hack to me. There is a fork on GitHub (https://github.com/SickheadGames/TinyPG), but that hasn't been updated for a year as well. I was going to say it might be worth raising an issue there, but assuming you are "jrleek", it looks like you did so already :) – David Arno May 22 '15 at 09:34
  • Yeah, that was me. I've actually identified the problem area in the generated code, but I'm not 100% sure what the solution is in general. As I said, I'm pretty rusty on this stuff. I think it's either a lexer design bug, or a Reduce/Reduce conflict. In Yacc I think reduce/reduce conflicts are solved by the ordering of the rules, which I think TinyPG doesn't pay much attention to. – Jim May 22 '15 at 13:13

1 Answers1

0

You might be the same guy since the issue looks identical, as the one I answered on GitHub, but here it is again for people who google this issue.

Here is an example from the Simple-CIL-compiler project, The identifier has to catch single words except the ones listed, which means you have to include the exception token's in to the identifier

IDENTIFIER-> @"[a-zA-Z_][a-zA-Z0-9_]*(?<!(^)(end|else|do|while|for|true|false|return|to|incby|global|or|and|not|write|readnum|readstr|call))(?!\w)";

Hope that helps.

(Link to Original post)