3

java.lang.StringBuilder's appendCodePoint(...) method, to me, behaves in an unexpected manner.

For unicode code points above Character.MAX_VALUE (which will need 3 or 4 bytes to encode in UTF-8, which is my Eclipse workspace setting), it behaves strangely.

I append a String's Unicode code points one by one to a StringBuilder, but its output looks different in the end. I suspect that a call to Character.toSurrogates(codePoint, value, count) in AbstractStringBuilder#appendCodePoint(...) causes this, but I don't know how to work around it.

My code:

    // returns random string in range of unicode code points 0x2F800 to 0x2FA1F
    // e.g. 
    String s = getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(length);
    System.out.println(s);

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < getCodePointCount(s); i++) {
        sb.appendCodePoint(s.codePointAt(i));
    }
    // prints some of the CJK characters, but between them there is a '?'

    // e.g. ???????????????
    System.out.println(sb.toString());

    // returns random string in range of unicode code points 0x20000 to 0x2A6DF
    // e.g. 
    s = getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(length);
    // prints the CJK characters correctly
    System.out.println(s);

    sb = new StringBuilder();
    for (int i = 0; i < getCodePointCount(s); i++) {
        sb.appendCodePoint(s.codePointAt(i));
    }

    // prints some of the CJK characters, but between them there is a '?'
    // e.g. ???????????????
    System.out.println(sb.toString());

With:

public static int getCodePointCount(String s) {
    return s.codePointCount(0, s.length());
}

public static String getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(int length) {
    return getRandomStringOfMaxLengthInRange(length, 0x20000, 0x2A6DF);
}

public static String getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(int length) {
    return getRandomStringOfMaxLengthInRange(length, 0x2F800, 0x2FA1F);
}

private static String getRandomStringOfMaxLengthInRange(int length, int from, int to) {

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < length; i++) {

        // try to find a valid character MAX_TRIES times
        for (int j = 0; j < MAX_TRIES; j++) {

            int unicodeInt = from + random.nextInt(to - from);

            if (Character.isValidCodePoint(unicodeInt) &&
                    (Character.isLetter(unicodeInt) || Character.isDigit(unicodeInt) ||
                    Character.isWhitespace(unicodeInt))) {
                sb.appendCodePoint(unicodeInt);
                break;
            }

        }

    }

    return  new String(sb.toString().getBytes(), "UTF-8");
}
RobertG
  • 1,550
  • 1
  • 23
  • 42
  • Added some details of the convenience methods I use, though these may not be relevant. Interesting, though, that those originally use a StringBuilder as well, while new String(sb.toString().getBytes(), "UTF-8"); did not work for me. – RobertG Jan 13 '15 at 16:40
  • What I kindof did not understand back then was that UTF-16 is not UCS-2 - this one helped: http://stackoverflow.com/a/12280911/1143126 Strings have only UTF-16 encoding; only, as stated correctly by others, when they are read or transformed toByteArray() or sth. does this become an issue. – RobertG Mar 04 '16 at 12:46

2 Answers2

3

You're iterating over the code points incorrectly. You should use the strategy presented by Jonathan Feinberg here

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}

or since Java 8

s.codePoints().forEach(/* do something */);

Note the Javadoc of String#codePointAt(int)

Returns the character (Unicode code point) at the specified index. The index refers to char values (Unicode code units) and ranges from 0 to length()- 1.

You were iterating from 0 to codePointCount. If the character is not a high-low surrogate pair, it's returned alone. In that case, your index should only increase by 1. Otherwise, it should be increased by 2 (Character#charCount(int) deals with this) as you're getting the codepoint corresponding to the pair.

Community
  • 1
  • 1
Sotirios Delimanolis
  • 274,122
  • 60
  • 696
  • 724
  • *sigh* Yeah, I can't use Java 8 syntax, unfortunately :[ Thank you very much, though I prefer a for-loop style, charCount(int) seems to be more efficient than offsetByCodePoints(...) suggested in another answer. – RobertG Jan 14 '15 at 09:09
1

Change your loops from this:

for (int i = 0; i < getCodePointCount(s); i++) {

to this:

for (int i = 0; i < getCodePointCount(s); i = s.offsetByCodePoints(i, 1)) {

In Java, a char is a single UTF-16 value. Supplemental codepoints take up two chars in a String.

But you are looping every single char in your String. This means that you are reading each supplemental codepoint twice: The first time, you are reading both of its UTF-16 surrogate chars; the second time, you are reading and appending just the low surrogate char.

Consider a string which contains only one codepoint, 0x2f8eb. A Java String representing that codepoint would actually contain this:

"\ud87e\udceb"

If you loop through each individual char index, then your loop would effectively do this:

sb.appendCodePoint(0x2f8eb);    // codepoint found at index 0
sb.appendCodePoint(0xdceb);     // codepoint found at index 1
VGR
  • 40,506
  • 4
  • 48
  • 63