What exactly does String.codePointAt do?

Question

Recently I ran into codePointAt method of String in Java. I found also a few other codePoint methods: codePointBefore, codePointCount etc. They definitely have something to do with Unicode but I do not understand it.

Now I wonder when and how one should use codePointAt and similar methods.

Joachim Sauer · Accepted Answer · 2019-01-17T09:38:47.250

Short answer: it gives you the Unicode codepoint that starts at the specified index in String. i.e. the "unicode number" of the character at that position.

Longer answer: Java was created when 16 bit (aka a char) was enough to hold any Unicode character that existed (those parts are now known as the Basic Multilingual Plane or BMP). Later, Unicode was extended to include characters with a codepoint > 2¹⁶. This means that a char could no longer hold all possible Unicode codepoints.

UTF-16 was the solution: it stores the "old" Unicode codepoints in 16 bit (i.e. exactly one char) and all the new ones in 32 bit (i.e. two char values). Those two 16 bit values are called a "surrogate pair". Now strictly speaking a char holds a "UTF-16 code unit" instead of "a Unicode character" as it used to.

Now all the "old" methods (handling only char) could be used just fine as long as you didn't use any of the "new" Unicode characters (or didn't really care about them), but if you cared about the new characters as well (or simply need to have complete Unicode support), then you'll need to use the "codepoint" versions that actually support all possible Unicode codepoints.

Note: A very well known example of unicode characters that are not in the BMP (i.e. work only when using the codepoint variant) are Emojis: Even the simple Grinning Face U+1F600 can't be represented in a single char.

could you provide an example where `charAt()` would fail to give the complete Code Point but where `codePointAt()` would succeed? — Zaid Khan, Apr 22 '18 at 07:17
For Zaid Khan : String s3 = "\u0041\u00DF\u6771\uD801\uDC00"; System.out.println(s3.charAt(3)); System.out.println(s3.codePointAt(3)); — user1643352, Nov 20 '19 at 10:07

Peter Lawrey · Answer 2 · 2012-09-05T12:08:09.630

6

Code points support characters above 65535 which is Character.MAX_VALUE.

If you have text with such high characters you have to work with code points or int instead of chars.

It doesn't this by support UTF-16 which can use one or two 16-bit char and turn it into an int

AFAIK, Generally this is only required for Supplementary Multiligual and Supplementary Ideographic characters added recently such as non traditional Chinese.

edited Sep 05 '12 at 12:08

answered Sep 05 '12 at 11:57

Peter Lawrey

525,659
79
751
1,130

6

Well, not *only* non-traditional Chinese: https://en.wikipedia.org/wiki/Plane_(Unicode) Many lesser-known languages, some mathematical symbols, Emoticon and pretty much everything that is introduced into Unicode relatively recently lives outside the BMP. There's a [relevant question here](http://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-unicode-characters-in-actual-use). – Joachim Sauer Sep 05 '12 at 12:03

Sam YC · Answer 3 · 2021-06-17T07:36:02.313

The code example below help to clarify the use of codePointAt

    String myStr = "13";
    System.out.println(myStr.length()); // print 4, because  is two char
    System.out.println(myStr.codePointCount(0, myStr.length())); //print 3, factor in all unicode
    
    int result = myStr.codePointAt(0);
    System.out.println(Character.toChars(result)); // print 1
    
    result = myStr.codePointAt(1);
    System.out.println(Character.toChars(result)); // print , because codePointAt will get surrogate pair (high and low)
    
    result = myStr.codePointAt(2);
    System.out.println(Character.toChars(result)); // print low surrogate of  only, in this case it show "?"
    
    result = myStr.codePointAt(3);
    System.out.println(Character.toChars(result)); // print 3

score 0 · Answer 4 · edited May 23 '17 at 12:10

0

In short rarely as long you are using default charset in Java :) But for a more detailed explanation try these posts:

Comparing a char to a code-point? http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html http://javarevisited.blogspot.com/2012/01/java-string-codepoint-get-unicode.html

Hope this helps clarify things for you :)

edited May 23 '17 at 12:10

Community

1
1

answered Sep 05 '12 at 11:57

JTMon

3,189
22
24

2

Those methods are not (directly) related to character sets (except that there I know of no non-universal character sets that encode anything outside the BMP). – Joachim Sauer Sep 05 '12 at 12:05

What exactly does String.codePointAt do?

4 Answers4

Linked