12

I have this code. It sorts correctly in French and Russian. I used Locale.US and it seems to be right. Is this solution do right with all languages out there? Does it work with other languages? For example: Chinese, Korean, Japanese... If not, what is the better solution?

public class CollationTest {
    public static void main(final String[] args) {
        final Collator collator = Collator.getInstance(Locale.US);
        final SortedSet<String> set = new TreeSet<String>(collator);

        set.add("abîmer");
        set.add("abîmé");
        set.add("aberrer");
        set.add("abhorrer");
        set.add("aberrance");
        set.add("abécédaire");
        set.add("abducteur");
        set.add("abdomen");

        set.add("государственно-монополистический");
        set.add("гостить");
        set.add("гостевой");
        set.add("гостеприимный");
        set.add("госпожа");
        set.add("госплан");
        set.add("господи");
        set.add("господа");

        for(final String s : set) {
            System.out.println(s);
        }
    }
}

Update: Sorry, I don't require this set must contain all languages in order. I mean this set contain one language and sort correctly in every languages.

public class CollationTest {
    public static void main(final String[] args) {
        final Collator collator = Collator.getInstance(Locale.US);
        final SortedSet<String> set = new TreeSet<String>(collator);

        // Sorting in French.
        set.clear();
        set.add("abîmer");
        set.add("abîmé");
        set.add("aberrer");
        set.add("abhorrer");
        set.add("aberrance");
        set.add("abécédaire");
        set.add("abducteur");
        set.add("abdomen");
        for(final String s : set) {
            System.out.println(s);
        }

        // Sorting in Russian.
        set.clear();
        set.add("государственно-монополистический");
        set.add("гостить");
        set.add("гостевой");
        set.add("гостеприимный");
        set.add("госпожа");
        set.add("госплан");
        set.add("господи");
        set.add("господа");
        for(final String s : set) {
            System.out.println(s);
        }
    }
}
emeraldhieu
  • 9,380
  • 19
  • 81
  • 139
  • 4
    I don't think you can meaningfully define an ordering of inter-language words. – Flexo Oct 03 '11 at 10:09
  • 3
    Even if the set only contains one language, you will still need to pick the correct `Locale` for the `Collator` every time you want to sort. – 一二三 Oct 03 '11 at 10:42
  • English sorts all variations of a letter under that letter, so Ä and Å are treated as A. But in Swedish, Ä and Å are unique letters found after Z. – Liggliluff Oct 17 '18 at 21:51
  • Reminder that not all the "Asian languages" are the same. Korean, for example, uses an alphabet (like English) and has a well-defined sorting order. – user3932000 Nov 02 '20 at 03:50

3 Answers3

25

Because of every language has its own alphabetic order you can not. For example,

Russian language as you stated has с letter has a different order than Turkish language.

You should always use collator. What I can suggest you is to us Collection API.

    //
    // Define a collator for German language
    //
    Collator collator = Collator.getInstance(Locale.GERMAN);

    //
    // Sort the list using Collator
    //
    Collections.sort(words, collator);

For futher information check and as stated here

This program shows what can happen when you sort the same list of words with two different collators:

Collator fr_FRCollator = Collator.getInstance(new Locale("fr","FR"));

Collator en_USCollator = Collator.getInstance(new Locale("en","US"));

The method for sorting, called sortStrings, can be used with any Collator. Notice that the sortStrings method invokes the compare method:

 public static void sortStrings(Collator collator, 
                           String[] words) {
  String tmp;
     for (int i = 0; i < words.length; i++) {
        for (int j = i + 1; j < words.length; j++) { 
           if (collator.compare(words[i], words[j]) > 0) {
              tmp = words[i];
              words[i] = words[j];
              words[j] = tmp;
           }
         }
      }
 }

The English Collator sorts the words as follows:

peach péché pêche sin

According to the collation rules of the French language, the preceding list is in the wrong order. In French péché should follow pêche in a sorted list. The French Collator sorts the array of words correctly, as follows:

peach pêche péché sin

Cemo
  • 5,370
  • 10
  • 50
  • 82
  • If you were like me and read this great answer, but were not sure how to implement it, then check out this answer to a related question - https://stackoverflow.com/a/8433662/6110783 – Ben Feb 01 '22 at 20:32
10

Even if you could accurately detect the language being used, useful collation orders are usually specific to a particular language+country combination. And even within a language+country, collation can vary depending on usage or certain customisations.

However, if you do need to sort arbitrary sets of text, your best bet is the Unicode Collation Algorithm, which defines a language-independent collation for any Unicode text. The algorithm is customisable, but doesn't necessary give results that make sense to any one culture (and definitely not across them).

Java's collation classes don't implement this algorithm, but it is available as part of ICU's RuleBaseCollator.

一二三
  • 21,059
  • 11
  • 65
  • 74
  • 1
    In java you use Locale("") to get the root locale (in Java 7 there is a Locale.ROOT constant). the Collator for this locale is the UCA. – Robert Muir Oct 15 '11 at 20:59
0

As far I know, the Chinese do not have any order for their language, the Japanes possible have the order in the Hiragana or Katakana, but in Kanji it is doubtful. But in computers sience everything is represented by numbers same thing goes for languages sings. Each sign correspond to unique UNICODE number. So this might be the solution for you, sort the words using their UNICODE positions.