23

In Java, I learned that the following syntax can be used for mentioning Unicode characters that are not on the keyboard (eg. non-ASCII characters):

(\u)(u)*(HexDigit)(HexDigit)(HexDigit)(HexDigit)

My question is: What is the purpose of (u)* in the above syntax?

One use case that I understood which represents Yen symbol in Java is:

char ch = '\u00A5';
b4hand
  • 9,550
  • 4
  • 44
  • 49
user3265048
  • 349
  • 1
  • 2
  • 7

3 Answers3

34

Interesting question. Section 3.3 of the JLS says:

UnicodeEscape:
    \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

UnicodeMarker:
    u
    UnicodeMarker u

which translates to \\u+\p{XDigit}{4}

and

If an eligible \ is followed by u, or more than one u, and the last u is not followed by four hexadecimal digits, then a compile-time error occurs.

So you're right, there can be one or more u after the backslash. The reason is given further down:

The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.

This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.

So this input

 \u0020ä

becomes

 \uu0020\u00e4

The first uu means here "this was a unicode escape sequence to begin with" while the second u says "An automatic tool converted a non-ASCII character to a unicode escape."

This information is useful when you want to convert back from ASCII to unicode: You can restore as much of the original code as possible.

quantum
  • 3,672
  • 29
  • 51
Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
5

It means you can add as many u as you want - for example these lines are equivalent:

char ch = '\u00A5';
char ch = '\uuuuu00A5';
char ch = '\uuuuuuuuuuuuuuuuuu00A5';

(and all compile)

assylias
  • 321,522
  • 82
  • 660
  • 783
  • I removed my comment. I was mistaken and I also misunderstood his question, sorry. – Xabster Feb 03 '14 at 08:56
  • Is their any idea behind Java allowing extra u? when you do not require? Because we see that '\u00A5' is same as '\uuuuu00A5'? – user3265048 Feb 03 '14 at 08:59
  • @user3265048 Aaron explains why it is allowed - essentially as a unicode marker for compilers. – assylias Feb 03 '14 at 09:01
  • Ah!! i did not see aaron's update last time. I got the answer. But what do you mean ascii tools? eclipse is able to provide yen symbol, so you mean eclipe is more than ascii tool? – user3265048 Feb 03 '14 at 09:28
  • @user3265048 Yes, Eclipse works just fine with proper Unicode characters in the source code and everything. But there are some simple command line tools that can't deal with non-ASCII shit. – Paul Stelian Sep 06 '21 at 10:52
1

Java supports only \uXXXX (4 hex chars) notation for Unicode characters in the BMP but doesn't support the \u{YYYYY} (5 hex chars) notation for characters outside the BMP (16 other planes). So it's impossible to represent them into a single constant char, you'll have to write them as a surrogate pair.

For example, if you want to write MATHEMATICAL BOLD CAPITAL A (U+1D400) you can't write "u\{1D400}" it's an illegal Unicode escape sequence in Java. Writing "u\1D400" is only doing "u\1D40" + "0" so it will output ᵀ0. No you really have to use surrogates in Java. So you have to write "\uD835\uDC00" instead.

But writing surrogates is not handy, so if you want to write them directly from a code point you can use one of those tricks:

String test1 = new String(new int[] { 0x1D400 }, 0, 1);
String test2 = String.valueOf(Character.toChars(0x1D400));
String test3 = Character.toString(0x1D400):
noraj
  • 3,964
  • 1
  • 30
  • 38