I have a small demo app showing the issues with Java's substring implementation when using unicode codepoints that require surrogate pairs (i.e. cannot be represented in 2 bytes). I'm wondering if my solution works well or if I'm missing anything. I've considered posting on codereview but this has much more to do with Java's implementation of Strings than with my simple code itself.
public class SubstringTest {
public static void main(String[] args) {
String stringWithPlus2ByteCodePoints = "";
String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1);
String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2);
String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3);
System.out.println(stringWithPlus2ByteCodePoints);
System.out.println("invalid sub" + substring1);
System.out.println("invalid sub" + substring2);
System.out.println("invalid sub" + substring3);
String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 1);
String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 2);
String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, 3);
System.out.println("real sub:" + realSub1);
System.out.println("real sub:" + realSub2);
System.out.println("real sub:" + realSub3);
}
private static String getRealSubstring(String string, int beginIndex, int endIndex) {
if (string == null)
throw new IllegalArgumentException("String should not be null");
int length = string.length();
if (endIndex < 0 || beginIndex > endIndex || beginIndex > length || endIndex > length)
throw new IllegalArgumentException("Invalid indices");
int realBeginIndex = string.offsetByCodePoints(0, beginIndex);
int realEndIndex = string.offsetByCodePoints(0, endIndex);
return string.substring(realBeginIndex, realEndIndex);
}
}
The output:
invalid sub: ?
invalid sub:
invalid sub: ??
real sub:
real sub:
real sub:
Can I rely on my substring implementation to always give the desired substring that avoids Java's issues with using chars for its substring method?