Does Java has ambiguous syntax which needs more information about an identifier?

Question

NOTICE: This question is not about "Java do not have pointers"

In C language, the code identifier1 * identifier2 is ambiguous for two possible meaning:

If the identifier1 is a type, then this might be a pointer declaration.
If the identifier1 is a variable, then this might be a multiply statement.

The problem is that I cannot choose the right production when building the Syntax tree. I checked Clang's code and it seems that Clang has to put the type checking(by using a symbol table) to the parsing phase(correct me if I'm wrong).

Then I checked the code of javac(OpenJDK), it seems that on parsing phase, there's no semantic analysis involved. The parser can build an AST barely using the tokens.

So I'm curious if Java has the same ambiguous syntax problem? The problem that if the parser don't know an identifier's type, it can not choose the right production?

Or more generic, Does Java has syntax ambiguous that a parser cannot choose a production without other information more than a token stream?

I don't quite understand the question : java doesn't have pointers, so there can't be an ambiguity here, since `*` is always multiplication. — Sander De Dycker, Apr 08 '19 at 07:03
@SanderDeDycker I think the OP is talking in general, not just about `*`. In other words, are there _any_ symbols that can cause ambiguity while parsing the source that can only be solved by knowing the types used in the context. — Slaw, Apr 08 '19 at 07:31
Some operators are overloaded and may briefly confuse a programmer, e.g., `var1 + var2` might be *addition* if `var1 = 1` and `var2 = 2` or it might be *concatenation* if `var1 = "a"` and `var2 = "b"`. In mixed case - `var1 = "a"` and `var2 = 2` the result is a string. However, the result of the `+` operator is based on the types involved and these are known at compile time, so there is no ambiguity. In the case of objects `Long + Long` produces a `long`. But `Long + null` will not compile unless you specify if it should be `Long` or `String` — VLAZ, Apr 08 '19 at 07:32
@Lundin `+` means two different things. I was taught that this is because the operator is overloaded. Not by a programmer, but it still does multiple operations. Is that incorrect? — VLAZ, Apr 08 '19 at 07:34
@Lundin Java may not support user-defined operator overloading but `+` is overloaded since it can mean addition or string concatenation, depending on the types involved. — Slaw, Apr 08 '19 at 07:34
If you mean string concatenation that's not really overloading, just another operator with the same symbol. — Lundin, Apr 08 '19 at 07:36
@Lundin Is that not exactly what operator overloading is? Besides, string concatenation can be considered a kind of addition. — Slaw, Apr 08 '19 at 07:37
@Slaw No, operator overloading means providing a user-defined function that gets called when a matching operator appears in the code. Otherwise C would have "operator overloading" too, all over the place. — Lundin, Apr 08 '19 at 07:39
@SanderDeDycker I'm talking about the general, which is that parser need to get imformation more than just identifier values to choose a production. — , Apr 08 '19 at 07:39
@Lundin Hmm, in the case of `+` in Java I might disagree. Implementation wise, it ends up creating a `StringBuilder`, appending input, and creating the final `String`. So in a sense it _is_ calling a user-defined function (where the user is the JDK developers). Though this only happens when the arguments in the string concatenation aren't all string literals; otherwise, I believe the compiler computes the result ahead of time. — Slaw, Apr 08 '19 at 07:45
Overloading is defined as "specific case of polymorphism, where different operators have different implementations depending on their arguments." (wikipedia). So yes, + in java is overloading. But otoh, one could reason that in C it is overloaded, too, since adding two integers is differently implemented than adding two floats. So there is plenty of interpretation room here. — Ctx, Apr 08 '19 at 07:47
@Ctx I'd agree with that interpretation. If we are to think of it as functions `(long, long) -> long` and `(int, int) -> int` are surely different ones even if they are called the same. Similarly `organization.add(Department dep)` and `ogranization.add(Person boardDirector)` are also overloaded. — VLAZ, Apr 08 '19 at 07:55

score 1 · Answer 1 · answered Apr 08 '19 at 07:20

1

I don't think so Java has this problem as Java is strongly typed. Also, Java does not support Pointers so there is no chance of the above issue. I hope this answer your question.

answered Apr 08 '19 at 07:20

Rohail Usman

119
3
12

This isn't about strong typing. This is about ambiguous grammar. Moreover, the ambiguity isn't limited to C's pointer syntax. – Sapphire_Brick Jan 19 '21 at 04:18

TreffnonX · Accepted Answer · 2019-04-08T08:02:44.107

1

Tokenization is always context sensitive, for languages. However Java does not have operators that are this sensitive. You can, however chain tokens in such a way, that it produces ambiguity, but not only as part of a larger syntactical statement:

A < B can be part of both public class A < B > { ... } or if (A < B) { ... }. The first is a generic class definition, the second is a comparison.

This is just the first example from the top of my hat, but I presume there are more. However, the operators are usually very narrowly defined, and cannot (as in C/C++-like languages) be overloaded. Also, other than in C/C++ there is only one accessor-operator (the dot: .), with one exception (since Java 8, the double-colon ::). In C++ there are a bunch, so it is much less chaotic.

To the specific question about whether Java is always syntactically decidable: Yes. A well-implemented compiler can always decide what token is present, depending on a token stream.

edited Apr 08 '19 at 08:02

answered Apr 08 '19 at 07:33

TreffnonX

2,924
15
23

In the template example, If I look ahead further, then I can checkout if this is a template declare or a compare statement, right? Can I think in this way: In Java, there's no ambiguous like this that even got the whole sentences, the parser still cannot choose a production? – Apr 08 '19 at 07:50
2

You could think like this: In Java, there is no ambiguity of syntax, at least to my knowledge. It should always be decidable for the compiler, what kind of language element a token represents. However, there can be an ambiguity of semantics, if the compiler cannot decide on a method to be called, because two methods have ambiguous headers. This can happen with lambda-expressions and the `::`-operator. – TreffnonX Apr 08 '19 at 07:59

Ctx · Answer 3 · 2019-04-08T08:13:20.750

0

Your question cannot be answered easily; this depends on the production rules you have. You say:

there's two production:
<pointer> ::= * {<type-qualifier>}* {<pointer>}?
or
<multiplicative-expression> ::= <multiplicative-expression> * <cast-expression>

But this is not the only possible parser!

With C when looking at

foo * bar;

which could either be a pointer called bar to type foo or the multiplication of foo with bar can be parsed to the token stream:

identifier_or_type ASTERISK identifier_or_type SEMICOLON

and the rest is up to the parser "business logic". So there is no ambiguity at parser level at all here, the logic behind the rule makes the difference between the two cases.

edited Apr 08 '19 at 08:13

answered Apr 08 '19 at 08:00

Ctx

18,090
24
36
51

I don't think so, by talking parsing, I mean to build an AST which all of it node is a certain production for sure. About what you mentioned, the parser still don't know which one to choose. – Apr 08 '19 at 08:04
@reavenisadesk There is only one production here, what should it choose from? – Ctx Apr 08 '19 at 08:04
No, there's two production, ` ::= * {}* {}?` or ` ::= * ` – Apr 08 '19 at 08:07
1

@reavenisadesk My point is exactly that this does not _have_ to be the parser. The rule in the answer above is an unambiguous parsing rule for both cases, the logic behind the rule makes the difference between the two cases. This removes the ambiguity from the parser level. – Ctx Apr 08 '19 at 08:09
No, if you really write a parser, especially a ll(k), you wont just make "id * id" as a not for sure node, because in more generic situation, both a pointer declare or a multiply statement may has non-terminals and need a further parsing. I got you, you just point that "id * id" can be parse, but I don't think any one would think that leave this statement unknown is OK in parsing phase. – Apr 08 '19 at 08:19
@reavenisadesk Under the hood you generate multiple intermediate representations of the source code on its way to the assembler code; I do not think that this is a problem here. There may be other options to migitate this, but a pure lalr(1) parser cannot distinguish between the two semantics you mention. – Ctx Apr 08 '19 at 08:35
This cannot be parsed by using lalr. See https://stackoverflow.com/a/1004737/2269707. GLR is present to fix this reduce conflict. – Apr 08 '19 at 08:43
@reavenisadesk It conflicts, if you have _two_ rules; but not, if you have only one rule as proposed in the answer. A GLR parser is not really appropriate here, since you cannot decide the rule at syntax level even when knowing the whole text. When taking other semantics into account (i.e. deciding if it is a type identifier or a variable) you can decide it immediately, since type definitions or variable declarations must be prior to its usage in C. – Ctx Apr 08 '19 at 08:48

score 0 · Answer 4 · answered Apr 08 '19 at 08:56

0

An expression like foo.bar.bla.i can not be parsed in a meaningfull way using the syntax alone. Each of foo, bar and bla can be either part of the package name, a static variable (this one does not apply to foo), or the name of a inner class.

Example:

public class Main {
    public static void main(String[] args) {
        System.out.println(foo.bar.bla.i);
    }
}

package foo;
public class bar {

    public static class bla {
        public static int i = 42;
    }

//  public static NotBla bla = new NotBla();
    public static class NotBla {
        public static int i = 21;
    }
}

This will print either 21 or 42 when the static variable bla is commented out or not.

answered Apr 08 '19 at 08:56

A.H.

63,967
15
92
126

Nice point, but I think this is a scope priority problem, and no matter with or without the comment, the foo.bar.bla is just scopes in parser's level, right? – Apr 08 '19 at 09:04
@reavenisadesk: I don't understand your point. Scoping (as in "Where does this "reference to `x`" *really* point to?") comes after parsing (i.e. the Abstract Syntax Tree has been built) and is indeed one solution to circumvent the problem. And that's exactly the answer to the question: You cannot parse correctly without additional information (for example from scoping). You cannot declare a Java-Grammar with rules like this: `FullQualifiedClassName := (PackageName '.')? ClassName; PackageName := ID ('.' ID)*;`. – A.H. Apr 08 '19 at 15:59

Does Java has ambiguous syntax which needs more information about an identifier?

4 Answers4