3

Is there any use case where having codePointBefore() would be advantageous? If you have the index you can already codePointAt(i-1).. ?

Zhro
  • 2,546
  • 2
  • 29
  • 39
  • 1
    possible duplicate to http://stackoverflow.com/questions/12280801/what-exactly-does-the-string-codepointat-do – Rahul Winner Jan 25 '16 at 06:44
  • 2
    Not a duplicate. I'm asking about the relevancy of `codePointBefore()` as an alternative to `codePointAt()` in a use case scenario. – Zhro Jan 25 '16 at 06:48

1 Answers1

4

A code point may consist of multiple char's which are still only 16-bit unicode. The index given to the methods in String in an index of it's underlying array char[] value not the index of a code point. These check bounds and wrap methods of Character:

//Java 8 java.lang.String source code
public int codePointAt(int index) {
    if ((index < 0) || (index >= value.length)) {
        throw new StringIndexOutOfBoundsException(index);
    }
    return Character.codePointAtImpl(value, index, value.length);
}
//...
public int codePointBefore(int index) {
    int i = index - 1;
    if ((i < 0) || (i >= value.length)) {
        throw new StringIndexOutOfBoundsException(index);
    }
    return Character.codePointBeforeImpl(value, index, 0);
}

the corresponding methods in Character identify and combine multiple char that belong to a single code point:

//Java 8 java.lang.Character source code
static int codePointAtImpl(char[] a, int index, int limit) {
    char c1 = a[index];
    if (isHighSurrogate(c1) && ++index < limit) {
        char c2 = a[index];
        if (isLowSurrogate(c2)) {
            return toCodePoint(c1, c2);
        }
    }
    return c1;
}
//...
static int codePointBeforeImpl(char[] a, int index, int start) {
    char c2 = a[--index];
    if (isLowSurrogate(c2) && index > start) {
        char c1 = a[--index];
        if (isHighSurrogate(c1)) {
            return toCodePoint(c1, c2);
        }
    }
    return c2;
}

The difference is important because index-1 is not always the start of the previous code point; So codePointBefore() needs to start at index-1 and look backwards, while codePointAt() needs to starts at index and look forward.

Linus
  • 894
  • 7
  • 13
  • I think you're looking at a different specification. According to Oracle Javadocs the signatures are: `int codePointAt(int index)` and `int codePointBefore(int index)`: https://docs.oracle.com/javase/7/docs/api/java/lang/String.html – Zhro Jan 25 '16 at 07:19
  • No I'm citing the `Character` method `String` wraps. If you give me a minute I can post the String code as well. – Linus Jan 25 '16 at 07:20
  • So `codePointAt()` considers a surrogate codepoint separate from its parent? I know that `codePointCount()` considers a composite pair to be a single codepoint. But since the implementation is acting on a char array, then the index must be working off of String.length() not String.codePointCount(), as I had assumed. Can you confirm? – Zhro Jan 25 '16 at 08:01
  • I'm trying to get a grip on what does what when it comes to handling Unicode and codepoints in Java. Here is where I'm coming from: http://stackoverflow.com/questions/34984271/why-is-javas-definition-of-char-codepoint-and-string-codepoint-length-contrad/34984359 – Zhro Jan 25 '16 at 08:07
  • Yes. String was built on char given a fixed length of 16-bits, so index refers to an index of char[] in String whether it's a high surrogate or not. – Linus Jan 25 '16 at 08:12
  • If that's the case then, yes, this function does make sense. Thank you for providing the implementation in your proof. – Zhro Jan 25 '16 at 08:23
  • No problem. Java UTF-16 by default so not all code points are surrogate pairs, just those not in the basic plane; which is probably why it doesn't index String differently. Backwards comparability is probably why they don't change the width of char. Of course I'm just speculating. – Linus Jan 25 '16 at 08:32