The following code is a valid Java program.
public class Foo
{
public static void \u006d\u0061\u0069\u006e(String[] args)
{
System.out.println("hello, world");
}
}
The main
identifier is written using Unicode escape sequences. It compiles and runs fine.
$ javac Foo.java && java Foo
hello, world
Although the following details may not be necessary for this question, I am sharing it in case someone is curious about it. I am using Java compiler from OpenJDK on Debian 8.0 but what I ask in this question should be applicable to any Java compiler.
$ javac -version
javac 1.7.0_79
$ readlink -f $(which javac)
/usr/lib/jvm/java-7-openjdk-amd64/bin/javac
The following program is an error because the escape sequence used to write m
of main
is invalid.
public class Foo
{
public static void \u6d\u0061\u0069\u006e(String[] args)
{
System.out.println("hello, world");
}
}
The compiler complains about illegal unicode sequence.
$ javac Foo.java && java Foo
Foo.java:3: error: illegal unicode escape
public static void \u6d\u0061\u0069\u006e(String[] args)
^
Foo.java:3: error: invalid method declaration; return type required
public static void \u6d\u0061\u0069\u006e(String[] args)
^
2 error
What surprised me is that the following program is also invalid even though the illegal unicode escape sequence seems to appear to be in a comment.
public class Foo
{
// This comment contains \u6d.
public static void main(String[] args)
{
System.out.println("hello, world");
}
}
Here is the error.
$ javac Foo.java && java Foo
Foo.java:3: error: illegal unicode escape
// This comment contains \u6d.
^
1 error
The compiler complains about the illegal unicode escape sequence although it appears to be in a comment.
The reason behind this behaviour becomes clear when we see how an end-of-line comment is defined in JLS §3.7.
EndOfLineComment:
/ / {InputCharacter}
JLS §3.4 defines InputCharacter
as follows.
InputCharacter:
UnicodeInputCharacter but not CR or LF
Finally, JLS §3.3 defines UnicodeInputCharacter
as follows.
UnicodeInputCharacter:
UnicodeEscape
RawInputCharacter
UnicodeEscape:
\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
u {u}
HexDigit:
(one of)
0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
RawInputCharacter:
any Unicode character
Therefore, the lexical analyzer is required to first recognize the Unicode escape sequences in order to recognize comments, and if an illegal Unicode escape sequence is found, the lexical analysis would fail and an error would occur. Therefore, the compiler would never proceed to recognizing the comment that contained the illegal Unicode escape sequence.
Although I used to think that everything from the start of a comment (say //
) till the end is ignored, the above example shows that this is not the case because the lexical analyzer has to recognize Unicode escape sequences between the start of a comment and the end of a comment, and an illegal Unicode escape sequence can cause the lexical analysis to fail.
What else can cause the compiler to fail while parsing a comment?