Sort latin characters to the end in Japanese sorting

Question

I'd like to sort strings in Japanese (that may contain the various japanese characters as well as latin chars), and the latin chars should be sorted to the end.

final Collator collator = Collator.getInstance(Locale.JAPANESE);
List<String> objcts = new ArrayList<>();

objcts.add("Alpha");
objcts.add("家事問屋");

Collections.sort(objcts, collator);
System.out.println(objcts);

Out: [Alpha, 家事問屋]

Desired Out: [家事問屋, Alpha]

Is there a simple way known how to achive this?

I need to ask you to elaborate here, will each item be either fully Japanese or fully Latin? Will there be no mixing? — Aya Noaman, Aug 11 '21 at 07:42

hc_dev · Answer 1 · 2021-08-11T09:15:49.753

Probably you could implement a Comparator or extend Collator that ranks Latin before CJK using a regex like this:

public class LatinBeforeCJKCollator implements Comparator<String> {

    private final Collator collator;

    public LatinBeforeCJKCollator(Collator collator) {
        this.collator = collator;
    }

    @Override
    public int compare(String source, String target) {
        if (source.matches("[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}]+") && target.matches("\\p{IsLatin}+")) {
            return -1;
        }
        if (source.matches("\\p{IsLatin}+") && target.matches("[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}]+")) {
            return 1;
        }
        return collator.compare(source, target);
    }

}

I used Unicode character-sets from answer to this question: How can I detect japanese text in a Java string?

You might need to customize the matching (e.g. all letters are latin, first letter is latin, etc.) after your needs.

When used like this:

final Comparator comparator = new LatinBeforeCJKCollator(Collator.getInstance(Locale.JAPANESE);
List<String> strings = List.of("Alpha", "Beta", "問屋", "家事問屋");

System.out.println(strings.stream().sorted(collator).collect(Collectors.joining(",")));

Then the output would appear sorted like this:

家事問屋,問屋,Alpha,Beta

There's a bug in your comparator: you need the reverse check as well, otherwise `sgn(compare(x, y)) != -sign(compare(y, x))`. — Joachim Sauer, Aug 11 '21 at 09:08

score 0 · Answer 2 · answered Aug 11 '21 at 07:40

I guess, the letters are in Unicode.

The range of Latin letters is

Wiki in this wiki article says:

As of version 13.0 of the Unicode Standard, 1,374 characters in the fo: llowing blocks are classified as belonging to the Latin script:2

Basic Latin, 0000–007F. This block corresponds to ASCII.

Latin-1 Supplement, 0080–00FF

Latin Extended-A, 0100–017F

Latin Extended-B, 0180–024F

IPA Extensions, 0250–02AF

Spacing Modifier Letters, 02B0–02FF

Phonetic Extensions, 1D00–1D7F

Phonetic Extensions Supplement, 1D80–1DBF

Latin Extended Additional, 1E00–1EFF

Superscripts and Subscripts, 2070–209F

Letterlike Symbols, 2100–214F

Number Forms, 2150–218F

Latin Extended-C, 2C60–2C7F

Latin Extended-D, A720–A7FF

Latin Extended-E, AB30–AB6F

Alphabetic Presentation Forms (Latin ligatures) FB00–FB4F

Halfwidth and Fullwidth Forms, FF00–FFEF

So most of them are before the Japanese. Using these ranges, you could make that Japanese letters are put in front.

And the range of Japanese is

Japanese-style punctuation ( 3000 - 303f)
Hiragana ( 3040 - 309f)
Katakana ( 30a 0 - 30ff)
Full-width roman characters and half-width katakana ( ff00 - ffef)
CJK unifed ideographs - Common and uncommon kanji ( 4e00 - 9faf)

listed here. According to this post.

feedy · Answer 3 · 2021-08-11T08:02:45.327

Does the order of the Japanese and English strings matter? If yes, you need to implement your own comparison method for the collator.

If the order does not matter, you can just do:

Collections.sort(objcts, Collections.reverseOrder());

To add a bit more to this - a collator is usually used for a single language, therefore you need to implement a way to differentiate the characters for the two alphabets. I would strongly suggest you to use two separate lists for English and Japanese text, where you detect what language the characters are in and decide in which list to put the word it. Then you can sort both lists accordingly and combine/use them as you wish.

score 0 · Answer 4 · answered Aug 11 '21 at 08:27

I don't code much in Java, but I can explain the steps you can take.

As far as I know, there is no alphabet string provided in Java, so you can create a string variable that contains the alphabet (both upper- and lower-case). Let's call it alphabet. The string would look like this: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

Then you'll have to make a variable containing the last index number (a.k.a. the size of the list). We will call it last.

Assuming each item is either fully Japanese or fully Latin and assuming that your list is already full, you can loop through the list and perform these steps on each item:

Get the first character in the string.
Test to see if it is in alphabet.
If True, set its index in the list to last. If False, leave it as it is.

That's basically it! I sincerely apologise for not being able to provide the code, as I code mostly in Python, but I hope this helped!

Sort latin characters to the end in Japanese sorting

4 Answers4