How Java execute the lexical translation?

Question

In the Jave Spec, I read that

A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.here

It means the lexical translation is only applied for ASCII character? Because when I tried to write a code with Cyrillic, Hebrew, or Kanji character, there are no compile-time error even though these characters are not ASCII?

I don't understand why? Can anyone help me to understand

You can put Unicode characters into comments, string literals, and char literals, yeah, but I'm pretty sure variable names and all need to be ASCII, I'm pretty sure. (You can also escape (\udddd) but the actual characters there are in ASCII). — user, Apr 21 '20 at 15:04
@user: but "This translation step allows any program to be expressed using only ASCII characters", what does it means? — locobe, Apr 21 '20 at 15:10
Basically if you want to represent, say, a line break in a comment, you can write \u000d, which is what a line break is in hex (I think). While the individual characters '\', 'u', '0', and 'd' are encoded as ASCII chars, when Java's compiler goes through them, they get turned into Unicode characters internally — user, Apr 21 '20 at 15:12
@user: String \u3058 = "" is fine although \u3058 is not ASCII — locobe, Apr 21 '20 at 15:13
The individual characters in "\u3058" are ASCII. Related: https://stackoverflow.com/questions/30727515/why-is-executing-java-code-in-comments-with-certain-unicode-characters-allowed — user, Apr 21 '20 at 15:14
Does this answer your question? [How does java handle unicode characters?](https://stackoverflow.com/questions/7482914/how-does-java-handle-unicode-characters) — Savior, Apr 21 '20 at 15:23

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

3

The quote doesn't say anything about what happens if you write a program containing a Cyrillic/Hebrew letter. In fact, the section just before the one you quoted says:

3.1 Unicode

Programs are written using the Unicode character set.

Note that "allows" here means that this translation step adds a new capability to Java. When you are allowed to do something, you can, but are not required to do it.

The quote merely says that the lexical translator will turn anything of the form \uxxxx to the corresponding Unicode character U+xxxx.

The natural consequence of this is that, you can write a program containing any Unicode code point (i.e. "any program") using only an ASCII keyboard. How? Whenever you need to write some non-ASCII character, just write its Unicode escape.

As a concrete example:

These are valid Java statements:

int Д = 0;
System.out.println("Д");

But let's say my text editor can only handle ASCII text, or that I only have a US keyboard, so I can't type "Д". The language spec says that I can still write this in ASCII, like this:

int \u0414 = 0;
System.out.println("\u0414");

It will do exactly the same thing.

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 21 '20 at 15:13

Sweeper

213,210
22
193
313

A classic (...) demonstration of this is `System.out.println("Hello \u0022 + \u0022 world")`, which prints `Hello world` (should have 2 spaces, SO is rendering as 1), rather than `Hello " + " world`. – Andy Turner Apr 21 '20 at 15:35
@Sweeper: You mean that the lexical translation can transform any Escape characters (ASCII as well as Non-ASCII). Especially, for programmers using their native languages (Hebrew,...) this process is particularly important because they can write any non-ascii characters. Right? – locobe Apr 21 '20 at 16:25
Regarding your first sentence, yes, escape sequences can denote both ASCII and non-ASCII characters, since ASCII is a subset of Unicode. I think this fact is more important for people who are trying to _read code_. e.g. In my code, I might use some exotic Unicode characters in my code. Your computer might not have a font that can display it, causing them all to look like "�" . If I write my code with escape sequences, you can at least see the different code points. @locobe – Sweeper Apr 21 '20 at 16:36

How Java execute the lexical translation?

1 Answers1

3.1 Unicode