For a Java program I'm writing, I have a particular need to sort strings lexicographically by Unicode code point. This is not the same as String.compareTo()
when you start dealing with values outside the Basic Multilingual Plane. String.compareTo()
compares strings lexicographically on 16-bit char
values. To see that this is not equivalent, note that U+FD00 ARABIC LIGATURE HAH WITH YEH ISOLATED FORM is less than U+1D11E MUSICAL SYMBOL G CLEF, but the Java String
object "\uFD00"
for the Arabic character compares greater than the surrogate pair "\uD834\uDD1E"
for the clef.
I can manually loop along the code points using String.codePointAt()
and Character.charCount()
and do the comparison myself if necessary. Is there an API function or other more "canonical" way of doing this?