What is the preferred way to compare two Java Strings lexicographically on Unicode code points?

Question

For a Java program I'm writing, I have a particular need to sort strings lexicographically by Unicode code point. This is not the same as String.compareTo() when you start dealing with values outside the Basic Multilingual Plane. String.compareTo() compares strings lexicographically on 16-bit char values. To see that this is not equivalent, note that U+FD00 ARABIC LIGATURE HAH WITH YEH ISOLATED FORM is less than U+1D11E MUSICAL SYMBOL G CLEF, but the Java String object "\uFD00" for the Arabic character compares greater than the surrogate pair "\uD834\uDD1E" for the clef.

I can manually loop along the code points using String.codePointAt() and Character.charCount() and do the comparison myself if necessary. Is there an API function or other more "canonical" way of doing this?

Do you definitely need it to be lexicographic, without any regard for normalization, locale etc? — Jon Skeet, Dec 09 '14 at 17:45
@JonSkeet The actual problem I'm trying to solve is that I have a quirky case in a file format I'm designing where I need a String ordering that 1) works for any Unicode character, 2) is locale-independent, and 3) is easy to specify so that other programs can replicate it. The actual ordering is somewhat less relevant. I picked Unicode code point order as it seemed the most straightforward to specify given the above constraints. Incidentally, the input strings will in fact be normalized to NFC due to other rules in the spec. — Aaron Rotenberg, Dec 09 '14 at 17:52
What languages are the other programs likely to be written in? If they're ones where UTF-16 is the norm (e.g. anything in .NET) then you could easily just say that you're comparing the UTF-16 code units lexically... — Jon Skeet, Dec 09 '14 at 17:53
@JonSkeet Don't know. The format is intended to be an open standard that anyone could generate with any language they choose. I've considered the option of specifying UTF-16 code unit order but I don't like it very much since a number of more recent languages don't use UTF-16 encoding natively. I've also worked with my team to try to come up with a way of avoiding having to specify an ordering in the format spec, but everything we've come up with for the situation in question causes more problems than it solves. — Aaron Rotenberg, Dec 09 '14 at 17:58
Fair enough. I don't know of a better way than `codePointAt` and iterating manually, basically... sorry not to have been able to give you a more canonical approach, but it sounds like you're already along the right lines. — Jon Skeet, Dec 09 '14 at 18:00
You need to look at [`java,text.Collator`](https://docs.oracle.com/javase/tutorial/i18n/text/locale.html). — user207421, Aug 14 '15 at 09:45
@EJP What Collator instance would give the described behavior? Remember, I'm not looking for locale-appropriate ordering, I'm looking for a very specific ordering that is defined in a locale- and platform-independent way. — Aaron Rotenberg, Aug 28 '15 at 16:51

score 1 · Answer 1 · answered Sep 02 '15 at 07:27

Its called Collations. See https://docs.oracle.com/javase/tutorial/i18n/text/locale.html

Note that your database can sort your query results using collations too. See for example what mysql supports https://dev.mysql.com/doc/refman/5.0/en/charset-charsets.html

What is the preferred way to compare two Java Strings lexicographically on *Unicode code points*?

1 Answers1

What is the preferred way to compare two Java Strings lexicographically on Unicode code points?