java.lang.StringBuilder's appendCodePoint(...) method, to me, behaves in an unexpected manner.
For unicode code points above Character.MAX_VALUE (which will need 3 or 4 bytes to encode in UTF-8, which is my Eclipse workspace setting), it behaves strangely.
I append a String's Unicode code points one by one to a StringBuilder, but its output looks different in the end. I suspect that a call to Character.toSurrogates(codePoint, value, count) in AbstractStringBuilder#appendCodePoint(...) causes this, but I don't know how to work around it.
My code:
// returns random string in range of unicode code points 0x2F800 to 0x2FA1F
// e.g.
String s = getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(length);
System.out.println(s);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < getCodePointCount(s); i++) {
sb.appendCodePoint(s.codePointAt(i));
}
// prints some of the CJK characters, but between them there is a '?'
// e.g. ???????????????
System.out.println(sb.toString());
// returns random string in range of unicode code points 0x20000 to 0x2A6DF
// e.g.
s = getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(length);
// prints the CJK characters correctly
System.out.println(s);
sb = new StringBuilder();
for (int i = 0; i < getCodePointCount(s); i++) {
sb.appendCodePoint(s.codePointAt(i));
}
// prints some of the CJK characters, but between them there is a '?'
// e.g. ???????????????
System.out.println(sb.toString());
With:
public static int getCodePointCount(String s) {
return s.codePointCount(0, s.length());
}
public static String getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(int length) {
return getRandomStringOfMaxLengthInRange(length, 0x20000, 0x2A6DF);
}
public static String getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(int length) {
return getRandomStringOfMaxLengthInRange(length, 0x2F800, 0x2FA1F);
}
private static String getRandomStringOfMaxLengthInRange(int length, int from, int to) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++) {
// try to find a valid character MAX_TRIES times
for (int j = 0; j < MAX_TRIES; j++) {
int unicodeInt = from + random.nextInt(to - from);
if (Character.isValidCodePoint(unicodeInt) &&
(Character.isLetter(unicodeInt) || Character.isDigit(unicodeInt) ||
Character.isWhitespace(unicodeInt))) {
sb.appendCodePoint(unicodeInt);
break;
}
}
}
return new String(sb.toString().getBytes(), "UTF-8");
}