Is there any use case where having codePointBefore()
would be advantageous? If you have the index you can already codePointAt(i-1)
.. ?
Asked
Active
Viewed 405 times
3

Zhro
- 2,546
- 2
- 29
- 39
-
1possible duplicate to http://stackoverflow.com/questions/12280801/what-exactly-does-the-string-codepointat-do – Rahul Winner Jan 25 '16 at 06:44
-
2Not a duplicate. I'm asking about the relevancy of `codePointBefore()` as an alternative to `codePointAt()` in a use case scenario. – Zhro Jan 25 '16 at 06:48
1 Answers
4
A code point may consist of multiple char
's which are still only 16-bit unicode. The index given to the methods in String in an index of it's underlying array char[] value
not the index of a code point. These check bounds and wrap methods of Character:
//Java 8 java.lang.String source code
public int codePointAt(int index) {
if ((index < 0) || (index >= value.length)) {
throw new StringIndexOutOfBoundsException(index);
}
return Character.codePointAtImpl(value, index, value.length);
}
//...
public int codePointBefore(int index) {
int i = index - 1;
if ((i < 0) || (i >= value.length)) {
throw new StringIndexOutOfBoundsException(index);
}
return Character.codePointBeforeImpl(value, index, 0);
}
the corresponding methods in Character identify and combine multiple char
that belong to a single code point:
//Java 8 java.lang.Character source code
static int codePointAtImpl(char[] a, int index, int limit) {
char c1 = a[index];
if (isHighSurrogate(c1) && ++index < limit) {
char c2 = a[index];
if (isLowSurrogate(c2)) {
return toCodePoint(c1, c2);
}
}
return c1;
}
//...
static int codePointBeforeImpl(char[] a, int index, int start) {
char c2 = a[--index];
if (isLowSurrogate(c2) && index > start) {
char c1 = a[--index];
if (isHighSurrogate(c1)) {
return toCodePoint(c1, c2);
}
}
return c2;
}
The difference is important because index-1
is not always the start of the previous code point; So codePointBefore()
needs to start at index-1
and look backwards, while codePointAt()
needs to starts at index
and look forward.

Linus
- 894
- 7
- 13
-
I think you're looking at a different specification. According to Oracle Javadocs the signatures are: `int codePointAt(int index)` and `int codePointBefore(int index)`: https://docs.oracle.com/javase/7/docs/api/java/lang/String.html – Zhro Jan 25 '16 at 07:19
-
No I'm citing the `Character` method `String` wraps. If you give me a minute I can post the String code as well. – Linus Jan 25 '16 at 07:20
-
So `codePointAt()` considers a surrogate codepoint separate from its parent? I know that `codePointCount()` considers a composite pair to be a single codepoint. But since the implementation is acting on a char array, then the index must be working off of String.length() not String.codePointCount(), as I had assumed. Can you confirm? – Zhro Jan 25 '16 at 08:01
-
I'm trying to get a grip on what does what when it comes to handling Unicode and codepoints in Java. Here is where I'm coming from: http://stackoverflow.com/questions/34984271/why-is-javas-definition-of-char-codepoint-and-string-codepoint-length-contrad/34984359 – Zhro Jan 25 '16 at 08:07
-
Yes. String was built on char given a fixed length of 16-bits, so index refers to an index of char[] in String whether it's a high surrogate or not. – Linus Jan 25 '16 at 08:12
-
If that's the case then, yes, this function does make sense. Thank you for providing the implementation in your proof. – Zhro Jan 25 '16 at 08:23
-
No problem. Java UTF-16 by default so not all code points are surrogate pairs, just those not in the basic plane; which is probably why it doesn't index String differently. Backwards comparability is probably why they don't change the width of char. Of course I'm just speculating. – Linus Jan 25 '16 at 08:32