5

The following code is a valid Java program.

public class Foo
{
    public static void \u006d\u0061\u0069\u006e(String[] args)
    {
        System.out.println("hello, world");
    }
}

The main identifier is written using Unicode escape sequences. It compiles and runs fine.

$ javac Foo.java && java Foo
hello, world

Although the following details may not be necessary for this question, I am sharing it in case someone is curious about it. I am using Java compiler from OpenJDK on Debian 8.0 but what I ask in this question should be applicable to any Java compiler.

$ javac -version
javac 1.7.0_79
$ readlink -f $(which javac)
/usr/lib/jvm/java-7-openjdk-amd64/bin/javac

The following program is an error because the escape sequence used to write m of main is invalid.

public class Foo
{
    public static void \u6d\u0061\u0069\u006e(String[] args)
    {
        System.out.println("hello, world");
    }
}

The compiler complains about illegal unicode sequence.

$ javac Foo.java && java Foo
Foo.java:3: error: illegal unicode escape
    public static void \u6d\u0061\u0069\u006e(String[] args)
                           ^
Foo.java:3: error: invalid method declaration; return type required
    public static void \u6d\u0061\u0069\u006e(String[] args)
                            ^
2 error

What surprised me is that the following program is also invalid even though the illegal unicode escape sequence seems to appear to be in a comment.

public class Foo
{
    // This comment contains \u6d.
    public static void main(String[] args)
    {
        System.out.println("hello, world");
    }
}

Here is the error.

$ javac Foo.java && java Foo
Foo.java:3: error: illegal unicode escape
    // This comment contains \u6d.
                                 ^
1 error

The compiler complains about the illegal unicode escape sequence although it appears to be in a comment.

The reason behind this behaviour becomes clear when we see how an end-of-line comment is defined in JLS §3.7.

EndOfLineComment:
/ / {InputCharacter} 

JLS §3.4 defines InputCharacter as follows.

InputCharacter:
  UnicodeInputCharacter but not CR or LF 

Finally, JLS §3.3 defines UnicodeInputCharacter as follows.

UnicodeInputCharacter:
  UnicodeEscape
  RawInputCharacter

UnicodeEscape:
  \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

UnicodeMarker:
  u {u}

HexDigit:
  (one of)
  0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

RawInputCharacter:
  any Unicode character

Therefore, the lexical analyzer is required to first recognize the Unicode escape sequences in order to recognize comments, and if an illegal Unicode escape sequence is found, the lexical analysis would fail and an error would occur. Therefore, the compiler would never proceed to recognizing the comment that contained the illegal Unicode escape sequence.

Although I used to think that everything from the start of a comment (say //) till the end is ignored, the above example shows that this is not the case because the lexical analyzer has to recognize Unicode escape sequences between the start of a comment and the end of a comment, and an illegal Unicode escape sequence can cause the lexical analysis to fail.

What else can cause the compiler to fail while parsing a comment?

Susam Pal
  • 32,765
  • 12
  • 81
  • 103
  • 1
    look [here](http://stackoverflow.com/questions/9225124/error-due-to-content-in-a-legal-comment-in-java) – Dando18 Sep 07 '15 at 16:25
  • @Dando18 Thanks for sharing the link. However, none of the answers there really answers this question. The answer that talks about `@deprecated` is not reproducible in OpenJDK. The answer that mentions `/* Compiler Error due to this Unicode char '*/' */` is incorrect because the trailing `*/` is clearly not within the comment. The other two answers don't address the specific question that was asked. – Susam Pal Sep 07 '15 at 16:45
  • 1
    http://stackoverflow.com/q/30727515/2158288 – ZhongYu Sep 07 '15 at 16:51
  • Actually, the `@deprecated` answer is reproducible using `javac -Xlint` (provided the `@deprecated` is in a *javadoc* comment (`/** @deprecated */`). – RealSkeptic Sep 07 '15 at 17:22
  • @RealSkeptic Could you please share the code and the error you get for using `@deprecated` in a javadoc comment and then compiling it with `javac -Xlint`? I am unable to reproduce any error with http://ideone.com/pcgSWq. I compiled it with `javac -Xlint` as well as `javac -Xlint:all` but no error occurred. – Susam Pal Sep 07 '15 at 17:39
  • Not an error, a warning. You can get one warning there if you remove the `@Deprecated` **annotation**, as `Xlint` warns when that happens, so you'll have a warning resulting solely from a comment. – RealSkeptic Sep 07 '15 at 17:51

1 Answers1

2

Short:

Nothing (nothing else).

Long:

Logically, the \u escape sequences are handled before lexical processing (scanning/tokenizing) takes place. According to https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.2:

A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:

  1. A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.

  2. A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (§3.4).

  3. A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements (§3.5) which, after white space (§3.6) and comments (§3.7) are discarded, comprise the tokens (§3.5) that are the terminal symbols of the syntactic grammar (§2.3).

So technically, \u6d in your example is NOT a part of the comment. Whether or not it belongs to that comment is determined after it is translated back to a unicode code-point. But unfortunately it fails there.

As a proof, following class should compile:

public class Test {
    // is comment, the rest, not\u000a public static void main( String[] args) {
        System.out.println("See!");
    }
}
Community
  • 1
  • 1
  • 1
    I think you should be emphasizing why that part of the JLS means that *nothing else* is going to cause an error in a comment, and less about the reason for the error, which the OP seems to already understand. – RealSkeptic Sep 07 '15 at 16:46