Java substring by code point indices (treating pairs of surrogate code units as single code point)

Question

I have a small demo app showing the issues with Java's substring implementation when using unicode codepoints that require surrogate pairs (i.e. cannot be represented in 2 bytes). I'm wondering if my solution works well or if I'm missing anything. I've considered posting on codereview but this has much more to do with Java's implementation of Strings than with my simple code itself.

public class SubstringTest {
    public static void main(String[] args) {

        String stringWithPlus2ByteCodePoints = "";

        String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1);
        String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2);
        String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3);

        System.out.println(stringWithPlus2ByteCodePoints);
        System.out.println("invalid sub" + substring1);
        System.out.println("invalid sub" + substring2);
        System.out.println("invalid sub" + substring3);

        String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 1);
        String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 2);
        String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, 3);
        System.out.println("real sub:"  + realSub1);
        System.out.println("real sub:"  + realSub2);
        System.out.println("real sub:"  + realSub3);
    }

    private static String getRealSubstring(String string, int beginIndex, int endIndex) {
        if (string == null)
            throw new IllegalArgumentException("String should not be null");
        int length = string.length();
        if (endIndex < 0 || beginIndex > endIndex || beginIndex > length || endIndex > length)
            throw new IllegalArgumentException("Invalid indices");
        int realBeginIndex = string.offsetByCodePoints(0, beginIndex);
        int realEndIndex = string.offsetByCodePoints(0, endIndex);
        return string.substring(realBeginIndex, realEndIndex);
    }

}

The output:


invalid sub: ?
invalid sub: 
invalid sub: ??
real sub: 
real sub: 
real sub:

Can I rely on my substring implementation to always give the desired substring that avoids Java's issues with using chars for its substring method?

It is not _"unicode characters larger than 2 bytes"_, it is unicode codepoints that require surrogate pairs. Every character in Java is UTF-16 (so by definition is 2 bytes, but that is not really relevant), but some codepoints can't be represented in UTF-16 without using surrogate pairs. — Mark Rotteveel, Apr 13 '19 at 07:55
Fair enough, I wasn't sure about the terminology. Edited the title/description — Sebastiaan van den Broek, Apr 13 '19 at 07:56
You are computing offsets for code points, but your input is tested as if `beginIndex` and `endIndex` were indexing code units in `beginIndex > length || endIndex > length`. You probably want something with `codePointCount`. — Andrey Tyukin, Apr 14 '19 at 10:16
Codepoints are a thing but are you sure you don't want [grapheme clusters](https://stackoverflow.com/questions/40878804/how-to-count-grapheme-clusters-or-perceived-emoji-characters-in-java)? — Tom Blodget, Apr 16 '19 at 16:34
@TomBlodget wow this is just a whole can of worms isn’t it? I personally was just experimenting in this case, but that does look like something that might be needed in some cases. — Sebastiaan van den Broek, Apr 16 '19 at 16:46

score 2 · Accepted Answer · answered Apr 14 '19 at 11:06

No need to walk to the beginIndex twice:

    public String codePointSubstring(String s, int start, int end) {
        int a = s.offsetByCodePoints(0, start);
        return s.substring(a, s.offsetByCodePoints(a, end - start));
    }

Translated from this Scala snippet:

def codePointSubstring(s: String, begin: Int, end: Int): String = {
  val a = s.offsetByCodePoints(0, begin)
  s.substring(a, s.offsetByCodePoints(a, end - begin))
}

I omitted the IllegalArgumentExceptions, because they don't seem to contain any more information than the exceptions that would be thrown anyway.

Java substring by code point indices (treating pairs of surrogate code units as single code point)

1 Answers1