Java charAt used with characters that have two code units

Question

From Core Java, vol. 1, 9th ed., p. 69:

The character ℤ requires two code units in the UTF-16 encoding. Calling
String sentence = "ℤ is the set of integers"; // for clarity; not in book
char ch = sentence.charAt(1)
doesn't return a space but the second code unit of ℤ.

But it seems that sentence.charAt(1) does return a space. For example, the if statement in the following code evaluates to true.

String sentence = "ℤ is the set of integers";
if (sentence.charAt(1) == ' ')
    System.out.println("sentence.charAt(1) returns a space");

Why?

I'm using JDK SE 1.7.0_09 on Ubuntu 12.10, if it's relevant.

It doesn't currently contain anything about the above (except to say the section numbering is wrong), but for reference, here's the [errata page](http://www.horstmann.com/corejava/bugs.html). — Greg Kopff, Jan 04 '13 at 03:16
Does the book say what code point this grapheme represents? There is scope for [ambiguity](http://www.unicode.org/charts/PDF/U1D400.pdf) as many code points look similar. — McDowell, Jan 04 '13 at 10:57
A more direct question that does not have the book bug: http://stackoverflow.com/questions/1527856/how-can-i-iterate-through-the-unicode-codepoints-of-a-java-string :-) — Ciro Santilli OurBigBook.com, May 07 '15 at 09:34

score 10 · Accepted Answer · answered Jan 04 '13 at 04:46

It sounds like tho book is saying that 'ℤ' is not a UTF-16 character in the basic multilingual plane, but in fact it is.

Java uses UTF-16 with surrogate pairs for characters that are not in the basic multilingual plane. Since 'ℤ' (0x2124) is in the basic multilingual plane it is represented by a single code unit. In your example sentence.charAt(0) will return 'ℤ', and sentence.charAt(1) will return ' '.

A character represented by surrogate pairs has two code units making up the character. sentence.charAt(0) would return the first code unit, and sentence.charAt(1) would return the second code unit.

See http://docs.oracle.com/javase/6/docs/api/java/lang/String.html:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

score 8 · Answer 2 · edited May 23 '17 at 10:29

8

According to the documentation String is represented internally as utf-16, so charAt() is giving you two code points. If you are interested in seeing the individual code points you can use this code (from this answer):

final int length = sentence.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = sentence.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}

edited May 23 '17 at 10:29

Community

1
1

answered Jan 04 '13 at 03:12

Jason Sperske

29,816
8
73
124

Where you would run into 'trouble' would be a Unicode supplementary character that took 4 bytes to represent, rather than 2 (the size of Java's `char`). – Greg Kopff Jan 04 '13 at 03:20
2

"Two bytes should be enough for everyone" - Bill Gåtes – Jason Sperske Jan 04 '13 at 03:22
@GregKopff - Yeah, but that gets ugly no matter what as `char` stops working as well. – Brian Roach Jan 04 '13 at 03:26
2

`charAt()` gives one code unit, which may be either a code point (for BMP characters) or a surrogate code unit, which might be informally characterized as “half a code point”. Never two code points. – Jukka K. Korpela Jan 04 '13 at 09:11
1

You can use `offset = sentence.offsetByCodePoints(offset, 1);` instead of using `offset += Character.charCount(codepoint);` – Remy Lebeau Jan 16 '18 at 23:38

score 3 · Answer 3 · answered Jan 26 '19 at 21:40

Horstmann was talking about the 'Z' which need two UTF-16 code units. Take a look at this code:

public class Main {
    public static void main(String[] args)
    {
        String a = "\uD83D\uDE02 is String";
        System.out.println("Length: " + a.length());
        System.out.println(a.charAt(0));
        System.out.println(a.charAt(1));
        System.out.println(a.charAt(2));
        System.out.println(a.charAt(3));
    }
}

in IntelliJ Idea I can't even paste the 4 byte character as one character because while pasting this emoji: IDE automatically converts it to: "\uD83D\uDE02". Notice that this emoji is counted as 2 characters.

If you want to count the 'real length' then should use: System.out.println("Real length: " + a.codePointCount(0, a.length()));

Take a look at: What are the most common non-BMP Unicode characters in actual use?

Brian Roach · Answer 4 · 2013-01-04T13:20:05.733

2

The Javadocs Explain this:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

~~In short, the book is wrong.~~

Edit to add from comments below: Something I didn't think of last night that was that the character you used in your question isn't actually the one they're talking about, and what they're really getting at is when you have have a character that required four bytes rather than two. The paragraph above in the Javadoc links to another javadoc; Unicode Character Representations which talks about the ramifications of this.

edited Jan 04 '13 at 13:20

answered Jan 04 '13 at 03:12

Brian Roach

76,169
12
136
161

drat! 21 seconds faster :P – Jason Sperske Jan 04 '13 at 03:14
Yeah, but you added the bit about getting the codepoints ... I was planning on editing to add that :) – Brian Roach Jan 04 '13 at 03:15
1

As far as I can see, the book is right about the basic principle, just wrong about the character used as an example. If ℤ were replaced by U+1D419 MATHEMATICAL BOLD CAPITAL Z, the presentation would be correct (but readers might still get confused). – Jukka K. Korpela Jan 04 '13 at 09:14
So ... the question is actually the problem. Agreed if it were a character that required two code units, then things would be different. – Brian Roach Jan 04 '13 at 13:16

Java charAt used with characters that have two code units

4 Answers4

Linked