Sorting the characters in a UTF-16 string in Java

Question

TLDR

Java uses two characters to represent UTF-16. Using Arrays.sort (unstable sort) messes with character sequencing. Should I convert char[] to int[] or is there a better way?

Details

Java represents a character as UTF-16. But the Character class itself wraps char (16 bit). For UTF-16, it will be an array of two chars (32 bit).

Sorting a string of UTF-16 characters using the inbuilt sort messes with data. (Arrays.sort uses dual pivot quick sort and Collections.sort uses Arrays.sort to do the heavy lifting.)

To be specific, do you convert char[] to int[] or is there a better way to sort?

import java.util.Arrays;

public class Main {
    public static void main(String[] args) {
        int[] utfCodes = {128513, 128531, 128557};
        String emojis = new String(utfCodes, 0, 3);
        System.out.println("Initial String: " + emojis);

        char[] chars = emojis.toCharArray();
        Arrays.sort(chars);
        System.out.println("Sorted String: " + new String(chars));
    }
}

Output:

Initial String: 
Sorted String: ????

This is what we call a "Collation". You should use a library for this because there are many collations to choose from. — Guillaume F., Apr 23 '19 at 02:33
I don't think that 'unstable sort' is a right word to use here: https://stackoverflow.com/questions/1517793/what-is-stability-in-sorting-algorithms-and-why-is-it-important — Artur Biesiadowski, Apr 23 '19 at 07:49
You are confusing Unicode with UTF-16. A Java `char` **is** a UTF-16 unit. Guess why it is called “UTF-16” and how it relates to the fact that a `char` has 16 bits. You may need two UTF-16 units to encode a single *codepoint*, but it’s not Java’s `char` to blame for that. — Holger, Apr 23 '19 at 10:57

Jacob G. · Accepted Answer · 2019-04-23T02:51:27.823

12

I looked around for a bit and couldn't find any clean ways to sort an array by groupings of two elements without the use of a library.

Luckily, the codePoints of the String are what you used to create the String itself in this example, so you can simply sort those and create a new String with the result.

public static void main(String[] args) {
    int[] utfCodes = {128531, 128557, 128513};
    String emojis = new String(utfCodes, 0, 3);
    System.out.println("Initial String: " + emojis);

    int[] codePoints = emojis.codePoints().sorted().toArray();
    System.out.println("Sorted String: " + new String(codePoints, 0, 3));
}

Initial String:

Sorted String:

I switched the order of the characters in your example because they were already sorted.

edited Apr 23 '19 at 02:51

answered Apr 23 '19 at 02:46

Jacob G.

28,856
5
62
116

1

Haha.. my string was already sorted... I couldn't tell because I couldn't sort (pun intended). I should move to java8 =) – dingy Apr 23 '19 at 05:08
4

@dingy Java 8 is EOL. You _need_ to move to Java 12. – Boris the Spider Apr 23 '19 at 06:45
3

Codepoint supports exists since Java 5. It’s only the Stream API, which makes it look almost a one-liner, that requires Java 8 or newer. – Holger Apr 23 '19 at 10:59

Stephen C · Answer 2 · 2019-04-24T04:49:09.540

6

If you are using Java 8 or later, then this is a simple way to sort the characters in a string while respecting (not breaking) multi-char codepoints:

int[] codepoints = someString.codePoints().sort().toArray();
String sorted = new String(codepoints, 0, codepoints.length);

Prior to Java 8, I think you either need to use a loop to iterate the code points in the original string, or use a 3rd-party library method.

Fortunately, sorting the codepoints in a String is uncommon enough that the clunkyness and relative inefficiency of the solutions above are rarely a concern.

(When was the last time you tested for anagrams of emojis?)

edited Apr 24 '19 at 04:49

answered Apr 23 '19 at 03:11

Stephen C

698,415
94
811
1,216

Thanks for reply. I was looking at Java 7's documentation, I should move to java 8. BTW, I am from China and making an app where I need to sort strings in Mandarin, just kidding, but it's a valid usecase. I stumbled upon it while I was trying to understand how Java works with UTF-16. Since other answers are same, I'll select the one which came earliest. Thanks again! – dingy Apr 23 '19 at 05:02
I didn't say invalid. I said uncommon. (And the fact that you had to make up a use-case only reinforces my point ... :-) ) – Stephen C Apr 23 '19 at 05:07
See also: https://chinese.stackexchange.com/questions/24053/chinese-anagrams. (First answer: "Why do you need that? We never use that in China.") – Stephen C Apr 23 '19 at 05:37
4

To add fuel to the flames, a single Emoji may consist of multiple codepoints. E.g. ‍♀️ consists of *five* codepoints (seven `char`s). But even latin characters may be composed of multiple codepoints. – Holger Apr 23 '19 at 11:33

peekay · Answer 3 · 2019-04-23T23:09:45.143

4

We can't use char for Unicode, because Java's Unicode char handling is broken.

In the early days of Java, Unicode code points were always 16-bits (fixed size at exactly one char). However, the Unicode specification changed to allow supplemental characters. That meant Unicode characters are now variable widths, and can be longer than one char. Unfortunately, it was too late to change Java's char implementation without breaking a ton of production code.

So the best way to manipulate Unicode characters is by using code points directly, e.g., using String.codePointAt(index) or the String.codePoints() stream on JDK 1.8 and above.

Additional sources:

The Unicode 1.0 Standard, Chapter 2 (pg. 10 and 22)
Supplementary Characters in the Java Platform (Sun/Oracle)

edited Apr 23 '19 at 23:09

answered Apr 23 '19 at 03:14

peekay

1,885
13
11

Thanks for reply, I completely missed the String::codePointAt api, also I think I should move to java 8. Since other answers are same, I'll select the one which came earliest. – dingy Apr 23 '19 at 05:04
1

@dingy If you're planning to make the JDK jump, consider skipping Java 8 and go straight to (Open) JDK 11 LTS, which has [some additional gems](https://blog.codefx.org/java/java-11-gems/). – peekay Apr 23 '19 at 11:25
Even before that change, there were *combining characters*, which invalidate the assumption that a single codepoint represents the entire character. – Holger Apr 23 '19 at 11:41
@Holger To be more precise, suppose we encode the letter `Á` using two characters: `A` (U+0041 Latin Capital Letter A) plus the combining character `◌́` (U+0301 Combining Acute Accent). In this case, notice that combining characters do not change the fact that each code point still only represent one character: we have two characters and two code points to represent the letter (_grapheme_) `Á`. – peekay Apr 23 '19 at 22:47
@MichaWiedenmann That's not correct. In Unicode 1.x a code point was always 16-bits and mapped to one Unicode character. See the [Unicode 1.0 Specification](https://www.unicode.org/versions/Unicode1.0.0/ch02.pdf). From the standard: `Unicode code points are 16-bit quantities.` (pg. 22) and `All Unicode characters have a uniform width of 16 bits.` (pg. 10). Code points larger than 16-bit (_supplementary characters_) were first assigned in Unicode 3.1. Java did not support them [until JDK 5.0](http://www.oracle.com/us/technologies/java/supplementary-142654.html) (September 2004). – peekay Apr 23 '19 at 23:05
Thank you for your clarification! I suggest you move part of your comment into the post, so we can clean up the comments here. – Micha Wiedenmann Apr 24 '19 at 07:22
@peekay `U+00c1` and `U+0041 U+0301` denote the same “*abstract character*” and changing the order of the code points such that the sequence does not represent that character anymore, is as wrong as changing the order of surrogate pairs, so regardless of the terminology, assuming that a program can shuffle `char`s around without caring for their meaning, was always wrong. – Holger Apr 24 '19 at 10:49

Sorting the characters in a UTF-16 string in Java

3 Answers3