Java Unicode translation

Question

I came across the following code:

public class LinePrinter {
    public static void main(String args[]) {
      //Note: \u000A is unicode for Line Feed
      char c=0x000A;
      System.out.println(c);
    }
}

This doesn't compile due to the Unicode replacement done.

The question is, why doesn't the comment (//) override Unicode replacement done by the compiler? I thought the compiler should ignore the comments first before doing anything else with the code translation.

EDIT:

Not sure if the above is clear enough.

I know what happens with the above and why it errors out. My expectation is that the compiler should ignore all the commented lines before doing any translation with the code. Obviously that's not the case here. I am expecting a rationale for this behaviour.

possible duplicate of [Why is executing Java code in comments with certain Unicode characters allowed?](http://stackoverflow.com/questions/30727515/why-is-executing-java-code-in-comments-with-certain-unicode-characters-allowed) — phuclv, Jun 29 '15 at 05:09

assylias · Answer 1 · 2012-12-07T11:24:35.850

5

It is in Java Puzzlers # 14 - an extract of the explanation:

The key to understanding this puzzle is that Java provides no special treatment for Unicode escapes within string literals. The compiler translates Unicode escapes into the characters they represent before it parses the program into tokens, such as strings literals [JLS 3.2].

Relevant pargraph in JLS v7 is paragraph 3.3:

A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged.

The introduction to section 3 of the JLS gives a hint as to why this is the case:

Programs are written in Unicode (§3.1), but lexical translations are provided (§3.2) so that Unicode escapes (§3.3) can be used to include any Unicode character using only ASCII characters.

edited Dec 07 '12 at 11:24

answered Dec 07 '12 at 11:09

assylias

321,522
82
660
783

This explains why compiler gives error. But my question is: why does the compiler parses unicode *before* leaving out the comments in the code. – user1885220 Dec 07 '12 at 11:11
2

@user1885220 Because it is part of the specification of the language. If your question is "why was the language specificed that way?", I have no idea. – assylias Dec 07 '12 at 11:12
3

@user1885220 Unicode escapes have to be processed before anything else to allow for things like `native2ascii`-ing code that uses non-ASCII characters in identifiers: `int é = 5;` -> `int \u00e9 = 5;` – Ian Roberts Dec 07 '12 at 11:17
@user1885220 re. your update: the rationale for the behaviour is that your compiler is compliant with the specification of the language! – assylias Dec 07 '12 at 11:21
@assylias That can be said about every compiler error: "Your code isn't compliant with the language spec". I think Ian Roberts answered what I asked for. – user1885220 Dec 07 '12 at 11:27
@IanRoberts If you can post your comment as an answer, I'll accept it. – user1885220 Dec 07 '12 at 11:35

score 2 · Accepted Answer · answered Dec 07 '12 at 11:35

The specification states that a Java compiler must convert Unicode escapes to their corresponding characters before doing anything else, to allow for things like non-ASCII characters in identifiers to be protected (via native2ascii) when the code is stored or sent over a channel that is not 8-bit clean.

This rule applies globally, in particular you can even escape comment markers using Unicode escapes. For example the following two snippets are identical:

// Deal with opening and closing comment characters /*, etc.
myRisquéParser.handle("/*", "*/");

\u002F\u002F Deal with opening and closing comment characters /*, etc.
myRisqu\u00E9Parser.handle("/*", "*/");

If the compiler were to try and remove comments before handling Unicode escapes it would end up stripping everything from the /*, etc. to the handle("/*", "*/, leaving

\u002F\u002F Deal with opening and closing comment characters ");

which would then be unescaped to one single line comment, and then removed at the next stage of parsing. Thus generating no compiler error or warning but silently dropping a whole line of code...

+1, I think you nailed *why is it so*. The other answer merely states the language spec what I already posted in the question/comments. Your examples are spot on! — user1885220, Dec 07 '12 at 11:39
C# solves this very nicely, with Unicode escapes in comments remain ignored. If you can write `\` then there's no need for escaping it just because many compilers can't deal with Unicode characters, as it's already in ASCII range. That'll prevent a lot of unexpected things like this http://stackoverflow.com/q/27332545/995714 or this http://stackoverflow.com/q/3866187/995714 or http://stackoverflow.com/q/30727515/995714 — phuclv, Jun 29 '15 at 05:08

Java Unicode translation

2 Answers2

Linked

Related