Are C# and Java Grammars LALR(x)?

Question

I wonder if C# and Java grammars are LALR(x)? If yes, what's the value of x?

Edit:

After accepting the true answer, I think it is better to change the Q in this way:

Is there any LALR(x) parser that could parse current releases of Java (version 7) or C# (version 4)? If yes, what is the value of x?

I see several suggestions to close this question. I cannot understand the reasoning; the question is quite clear. — Ira Baxter, Dec 05 '11 at 00:55

Ira Baxter · Accepted Answer · 2011-12-10T20:15:54.587

You can't ask this question without first designating a specific grammar for a langauge, as some grammars may be, and some may not.

Perhaps you mean the Java grammar as published in recent Java specifications. Do you mean for Java 7?

I'm not sure you can designate a specific grammar for C#, at least not one from Microsoft, especially for C# 4.0; I don't believe they have published a grammar.

I can tell you that i don't think C# can be LALR(x), because it has some elements which look like identifiers, but can be keywords in certain contexts. This requires the lexer to know what the parser is expecting to decide if an identifier-like token is a keyword, or just and identifier. Thus there has to be feedback from the parser to lexer, or the lexer has to produce both tokens and pass them to the parser to decide which it wants. LALR parsers are defined on token streams without any feedback, and where every input token has only one interpretation.

I don't think Java is, either, from Java 1.5 and up, when enum was introduced as a special type with its own keyword. This is because, for Java 1.5 compilers to process existing Java 1.4 programs that used enum as a variable name, enum must be treated as a keyword in some contexts, and as a variable name in others. So a Java 1.5 parser has the same issues as C# does.

As a practical matter, no real langauges are LALR(1) [first edition Java may be an exception] and anybody building a real parser (esp LALR) has to make some kind of hack to get around this. (GCC famously parsed C++ with an LALR parser with an awful symbol table hack for a long time, so it could tell the difference between an identifier as a variable, and an identifier as a typedef instance. It now has some kind of hand-implemented recursive descent parser, but I think the awful hack remains). So I'm not sure the value of answer to your question.

Our C# 4.0 and Java 7 members of our family of language front ends both parse the languages using a GLR parser, extended both with the feedback capability, and the ability to process two interpretations of the same token. GLR makes the question of LALR(x) moot, and the feedback and multiple interpretations let us handle many languages that would be outside of pure GLR's capability, too.

EDIT: After a bit of thought, there might be a truly ugly way to make both grammars handle their keyword-in-context. Let's use Java's enum as an example. There realistically has to be grammar rule:

  type = 'enum' '{'  enum_members '}' ;

But we also need to allow 'enum' as an identifer. We can do that, by replacing the terminal token identifier with a nonterminal:

  identifier = IDENTIFIER | 'enum' ;

and insist that IDENTIFIERs are the terminals produced by the lexer. Now at least the lexer does not have to decide how to treat enum; the parser does. But your designated grammar would have to shaped like this in order to even have a chance of being LALR(x).

Our parsers used to do this to allow some keywords to be used sometimes as identifiers. We changed our parsing engine as described earlier, and don't do this any more.

Thanks Ira for your fruitful answer. Referring to your enum example, now I am sure that the Java 7 and c# 4 languages cannot be parsed by LALR(x) parsers. — TonySalimi, Dec 05 '11 at 09:22
@hsalimi: The right interpretation is that you can't parse them with a pure LALR(x) parser. People make working parsers by taking whatevever parsing technology they have, building a grammar that respects the limitations of the parsing technology (having basically no choice about this), and then hacking something in the lexer/parser to make it work. — Ira Baxter, Dec 05 '11 at 10:06
Thanks again Ira for your answer and comments. In addition I also believe that the truth is in the details. :-) — TonySalimi, Dec 05 '11 at 14:23
As a note regarding `enum`, a Java compiler running at compliance level 1.5 or up *doesn't* permit `enum` to be used as an identifier, and a substantial number of libraries had to be rewritten after the 1.5 introduction. Instead, the compiler is selecting between different grammars. — chrylis -cautiouslyoptimistic-, Jan 31 '14 at 02:42
@IraBaxter how does the link to your company's product enhance this answer? — xaxxon, Oct 06 '16 at 05:18
@xaxxon: OP explicitly asks if there are LALR(x) grammars for Java and C#. The link provides explicit proof of same. — Ira Baxter, Oct 06 '16 at 11:07
The C# 5.0 spec is now published at https://www.microsoft.com/en-us/download/confirmation.aspx?id=7029 with grammar at end as noted at https://stackoverflow.com/questions/23456868/c-sharp-5-0-ebnf-grammar#comment35960329_23456868. I have put it in plain text & created a manual parser that converts it to a dict (so I may eventually get it converted to lark) at https://github.com/poikilos/pycodetool. Someone mentioned somewhere that the C# spec is a mixture of BNF & something else & therefore somehow close to EBNF so if anyone knows please clarify that & whether that helps with the question. — Poikilos, Aug 05 '22 at 06:15

score 14 · Answer 2 · edited Dec 19 '11 at 18:14

14

The Java grammar (version 1.0) is known to be LALR(1); this site provides a grammar and begins with the notice that

The grammar has been mechanically checked to insure that it is LALR(1).

I am not sure whether C# is LALR(1), but there is a C# parser written in bison available here, which suggests that it's probably LALR(1) (assuming that you allow for precedence declarations).

For what it's worth, typically LALR(1) is the only LALR parser used. If you need to use something like LALR(2) for a grammar, it is typically a better idea to use a LALR(1) parser with explicit precedence disambiguation, or a more powerful parser like a GLR parser.

Hope this helps!

edited Dec 19 '11 at 18:14

TonySalimi

8,257
4
33
62

answered Dec 04 '11 at 21:03

templatetypedef

362,284
104
897
1,065

2

Thanks, but I think the referred document belongs to the early version of Java. Are you sure that the new java language spec (including generics, etc) is still LALR(1)? – TonySalimi Dec 04 '11 at 21:10
There is a grammar for C# added to C# spec. – Lukasz Madon Dec 05 '11 at 01:45
1

"The C# spec"? You mean the ECMA spec? that isn't C# 4.0, and MS didn't implement that spec anyway. – Ira Baxter Dec 05 '11 at 10:07

score 5 · Answer 3 · edited Dec 19 '11 at 18:08

5

At least for Java (Version 1.0) it is: http://java.sun.com/docs/books/jls/first_edition/html/19.doc.html

edited Dec 19 '11 at 18:08

jamessan

41,569
8
85
85

answered Dec 04 '11 at 21:01

dbf

6,399
2
38
65

Are C# and Java Grammars LALR(x)?

3 Answers3

Linked

Related