Java Collator with similar characteristic as MySQLs utf8_general_ci collation

Question

Is there any Collator implementation which has the same characteristics as MySQL's utf8_general_ci? I need a collator which is case insensitive and does not distinguish german umlauts like ä with the vowel a.

Background: We recently encountered a bug which was caused by a wrong collation in our table. The used collation was utf8_general_ci where utf8_bin would be the correct one. The particular column had a unique index. The utf8_general_ci collation does not distinguish between words like pöker and poker, so the rows were merged, which was not desired. I now need a way to implement a module for our Java application, which repairs the wrong rows.

change the collation of particular column(unique index column) to `utf8_bin` — Kunal Surana, Mar 22 '16 at 10:23
We already did that. The remaining problem is repairing the existing rows. The application needs to rebuild those faulty rows using the raw data. — Benjamin, Mar 22 '16 at 10:32
If you want case folding, but accent sensitivity, please file a request at http://bugs.mysql.com . — Rick James, Mar 14 '17 at 22:50

Ilya Patrikeev · Accepted Answer · 2016-03-22T14:11:49.293

3

You could use the following collator:

Collator collator = Collator.getInstance();
collator.setStrength(Collator.PRIMARY);

A collator with this strength will only consider primary differences significant during comparison.

Consider an example:

System.out.println(compare("abc", "ÀBC", Collator.PRIMARY)); //base char
System.out.println(compare("abc", "ÀBC", Collator.SECONDARY)); //base char + accent
System.out.println(compare("abc", "ÀBC", Collator.TERTIARY)); //base char + accent + case
System.out.println(compare("abc", "ÀBC", Collator.IDENTICAL)); //base char + accent + case + bits

private static int compare(String first, String second, int strength) {
   Collator collator = Collator.getInstance();
   collator.setStrength(strength);
   return collator.compare(first, second);
}

The output is:

0
-1
-1
-1

Have a look at these links for more information:

http://www.javapractices.com/topic/TopicAction.do?Id=207 https://docs.oracle.com/javase/7/docs/api/java/text/Collator.html#PRIMARY

edited Mar 22 '16 at 14:11

answered Mar 22 '16 at 13:55

Ilya Patrikeev

352
3
10

1

Note that by using `Collator.getInstance();` you are leaving it to circumstances what collator you actually get... I recommend choosing and explicitly specifying a `Locale`... The question then becomes... what locale? As it stands this code will pick a French or German locale if the computer it's running on is set to those settings... Might be fine, or might require your user to change their Windows settings just to get the correct result in your program... – Stijn de Witt Jun 14 '16 at 19:53
1

Also see this blog post: [Using MySQL Collations in Java](http://techblog.molindo.at/2009/10/using-mysql-collations-in-java.html) – Stijn de Witt Jun 14 '16 at 19:58
1

Also see this SO question: http://stackoverflow.com/questions/33999947/java-sorting-is-not-the-same-with-mysql-sorting – Stijn de Witt Jun 14 '16 at 20:08

Java Collator with similar characteristic as MySQLs utf8_general_ci collation

1 Answers1

Linked