Why is the java compiler stripping all unicode characters before the actual compilation?

Question

I am very new to Java and I have code like this:

    public class Puzzle {
        public static void main(String... args) {
            System.out.println("Hi Guys!");
  //        Character myChar = new Character('\u000d');
       }
    }

You can see the line:

Character myChar = new Character('\u000d');

is commented out. But still, I get an error like this when I run javac:

Puzzle.java:9: error: unclosed character literal
//        Character myChar = new Character('\u000d');
                                                  ^
1 error

In this blog post I found the reason for the exception. The blog says:

Java compiler, just before the actual compilation strips out all the unicode characters and coverts it to character form. This parsing is done for the complete source code which includes the comments also. After this conversion happens then the Java compilation process continues.

In our code, the when Java compiler encounters \u000d, it considers this as a newline and changes the code as below,

public class Puzzle {
    public static void main(String... args) {
        System.out.println("Hi Guys!");
//      Character myChar = new Character('
        ');
   }
}

With this I have two questions:

Why does Java parse the unicode first? Are there any advantages to it?
Because the line is still commented, Java is trying to parse it! Is this the only case it does? Or does it generally parse the commented lines too? I'm confused.

Thanks in advance.

It's not clear what you're asking with Question 2. It's clear that the final code snippet will not compile. — Oliver Charlesworth, Dec 06 '14 at 14:13
You've already answered the second question :) Java doesn't parse comments, but after preprocessing, javac sees two lines, the 1st one is a comment, and the 2nd one is `')`, which is a syntax error. — Alex Shesterov, Dec 06 '14 at 14:14
And the answer to Question 1 is: when else would it do the Unicode translation? — Oliver Charlesworth, Dec 06 '14 at 14:14
@OliverCharlesworth Personally, I would expect the lexer to skip all the tokens in the comment. It doesn't seems logical to me to do this processing before the actual compilation. What would be the benefit? — Alexis C., Dec 06 '14 at 14:25
@ZouZou: For consistency. Unicode translation is simply the very first thing the compiler does (see http://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.2). — Oliver Charlesworth, Dec 06 '14 at 14:27

icza · Accepted Answer · 2014-12-06T14:48:22.533

Why Java parses the unicode first? Are there any advantages of it?

Yes, unicode sequences are first replaced before the compiler proceeds to lexicographical analysis.

Quoting from the The Java™ Language Specification §3.3 Unicode Escapes:

A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) for the indicated hexadecimal value, and passing all other characters unchanged.

So for example the following source code results in error:

// String s = "\u000d";

But this one is valid:

/*String s = "\u000d";*/

Because when \u000d is replaced with a new line it will look like this:

/*String s="
";*/

Which is totally fine with the multi-line comment /* */.

Also the following code:

public static void main(String[] args) {
    // Comment.\u000d System.out.println("I will be printed out");
    // Comment.\u000a System.out.println("Me too.");
}

Will print out:

I will be printed out
Me too.

Because after the unicode replace both System.out.println() statements will be outside of comment sections.

To answer your question: The unicode replace has to happen some time. One could argue that this should happen before or after taking out comments. A choice was made to do this before taking out the comments.

Reasonig might be because the comment is just another lexical element and prior to identify and analyze lexical elements you usually want to replace unicode sequences.

See this example:

/\u002f This is a comment line

If placed in a Java source, it causes no compile errors because \u002f will be translated to the character '/' and along with the preceeding '/' will form the start of a line comment //.

Because, the line is still commented, Java is trying to parse it! Is this the only case it does? Or it generally parses the commented lines too? I'm confused.

The Java compiler does not analyze comments but they still have to be parsed to know where they end.

Why is the java compiler stripping all unicode characters before the actual compilation?

1 Answers1

Linked