Where can I find a specific set of collation rules for equality comparison of strings?

Question

We all know that using String's equals() method for equality comparison will fail miserably. Instead, one should use Collator, like this:

// we need to detect User Interface locale somehow
Locale uiLocale = Locale.forLanguageTag("da-DK");
// Setting up collator object
Collator collator = Collator.getInstance(uiLocale);
collator.setStrength(Collator.SECONDARY);
collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
// strings for equality testing
String test1 = "USA lover Grækenland støtte";
String test2 = "USA lover graekenland støtte";
boolean result = collator.equals(test1, test2);

Now, this code works, that is result is true unless uiLocale is set to Danish. In such case it will yield false. I certainly understand why this happened: this is just because the method equals is implemented like this:

return compare(s1, s2) == Collator.Equal;

This method calls the one that is used for sorting and check if strings are the same. They are not, because Danish specific collation rules requires that æ to be sorted after (if I understand the result of compare method correctly) ae. However, these strings are really the same, with this strength both case differences and such compatibility characters (that's what its called) should be treated as equal.

To fix this, one would use RuleBasedCollator with specific set of rules that will work for the equality case.
Finally the question is: does anyone know where I can get such specific rules (not only for Danish, but for other languages as well), so that compatibility characters, ligatures, etc. be treated as equal (CLDR chart does not seem to contain such or I failed searching for it)?

Or maybe I want to do something stupid here, and I should really use simply UCA for equality comparison (any code sample, please)?

Strings equals() does exactly what its supposed to be doing and comparing words with equivalent spelling in certain languages is not part of that, so i find saying it fails miserably is misleading. — Stefan, Dec 05 '11 at 20:17
@Stefan: The problem is it is not. For example for strings containing accented characters or umlauts (à or ä) it will return **false** if one of the strings would use canonical decomposition. The spelling might be the same, doesn't matter. Even worse results will give you equalsIgnoreCase() - case variants like sharp s or final sigma won't be recognized. That's just because these methods use binary comparison which is not suitable for international strings. — Paweł Dyda, Dec 05 '11 at 23:08
the keyword is canonical decomposition. This is a (natural) language feature and has nothing to do with String represantation, actually in most cases you want them to be treated differently as a String. I agree with you on equalsIgnoreCase that one is bad because it blurrs the line between a String that is just a container for Characters and Words in a Language/Locale. — Stefan, Dec 06 '11 at 16:13
"We all know that using String's equals() method for equality comparison will fail miserably.". By what reference can you make such an assertion? The common definition of "equality" is the condition of being equal. Of course "USA lover Grækenland støtte" is not equal to "USA lover graekenland støtte", java or not? What are you asking? — Bob Kuhar, Jan 01 '12 at 04:32
@U Mad: I have tried full decomposition but it does not work for Danish locale. This is caused by their sorting collation rules which should be different than equality collation rules. The problem is, there is no such distinction in the JDK. As far as I understand [what I see](http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html#getKeywordValues(java.lang.String)) somebody meant to introduce this distinction in [ICU](http://site.icu-project.org/), thus the question. I wouldn't want to use full decomposition even if that works - there is a reason why it was removed from ICU. — Paweł Dyda, Jan 01 '12 at 18:10
@Bob Kuhar: I wrote that on purpose, so that people reading my question stop to think about equality comparison. What is **equal** anyway? For English language user, this is obvious - binary comparison will suffice. Unfortunately, the example I gave shows that it does not necessary holds true for other languages. Like it or not but to Danish user these strings ***are equal***. This regards to any strings written with compatibility decomposition (like in the example) or canonical decomposition. There are many use cases where this is important and using String.equals() could lead to subtle bugs. — Paweł Dyda, Jan 01 '12 at 18:15
@Zack: Did you mean override equals() method on String? Well, String is final and Java does not have extension methods so it won't be quite possible in "classic" Java. Well, one could use AOP but I don't think it makes sense. — Paweł Dyda, Jan 11 '12 at 16:47
@Zack: In case you missed it, I didn't ask *how to fix it* but whether somebody knows where I can find specific collation rules. I know how to, and I can create them on my own but it would be wasteful if I did and it is already done... — Paweł Dyda, Jan 11 '12 at 16:49
@PawełDyda - my bad - you stated your question clearly; I just didn't read it clearly! Will remove previous comment shortly... — Zack Macomber, Jan 11 '12 at 17:28
@PawełDyda - just want to clarify here...your equals() method is on a Collator object, though, isn't it? In the example above, I only see "collator.equals(test1, test2);" which isn't String.equals() right? — Zack Macomber, Jan 11 '12 at 17:36
@Zack: Yes, it is. The reason why it is on Collator is the fact that String's equals() method might yield incorrect results for non-ASCII strings. And while we at this, I guess I mentioned that RuleBasedCollator takes rule string as a constructor parameter. Therefore subclassing Collator doesn't make sense either. The only thing we are actually missing is a valid Locale-based data (which is the hardest). — Paweł Dyda, Jan 11 '12 at 19:20

beerbajay · Answer 1 · 2012-01-25T09:48:15.907

I can't find any existing Collator for danish; the built-in one for the Danish locale is supposed to be correct. I am not sure that your assumption that ae should be sorted with æ holds, specifically due to certain foreign words (for example "aerofobi") in danish (I am not a danish speaker, though I do speak swedish).

But, if you want to sort them together, it seems like you have two ways to do this, depending upon which context you're in. In certain contexts, just replacing the characters might be approprite:

String str = "USA lover graekenland støtte";
String sortStr = str.replace("ae", "æ");

The other, perhaps better, option is the one you specified; using RuleBasedCollator. Using the example from the javadocs, this is pretty trivial:

String danish = "< a, A < b, B < c, C < d, D < e, E < f, F < g, G < h, H < i, I" +
                "< j, J < k, K < l, L < m, M < n, N < o, O < p, P < q, Q < r, R" +
                "< s, S < t, T < u, U < v, V < w, W < x, X < y, Y < z, Z" +
                "< \u00E6 = ae," +       // Latin letter ae
                "  \u00C6 = AE " +       // Latin letter AE
                "< \u00F8, \u00D8" +     // Latin letter o & O with stroke
                "< \u00E5 = a\u030A," +  // Latin letter a with ring above
                "  \u00C5 = A\u030A;" +  // Latin letter A with ring above
                "  aa, AA";
RuleBasedCollator danishCollator = new RuleBasedCollator(danish);

Which you can then use:

String test1 = "USA lover Grækenland støtte";
String test2 = "USA lover Graekenland støtte";         // note capital 'G'
boolean result = danishCollator.equals(test1, test2);  // true

If you believe that the default collator is incorrect, you may wish to report a bug. (There have previously been similar bugs).

Update: I checked this with a printed danish-language encyclopedia. There are indeed word which begin with 'ae' (primarily words from foreign languages; "aerobics", for example) which are not sorted with (and therefore not equal to) word beginning with 'æ'. So although I see why you would want to treat them as equal in many circumstances, they are not strictly so.

I am not asking about sorting. Danish rules for sorting are correct. To be honest it is not even about Danish rules, just the rules for equality comparison. There are simply no such publicly available rules yet. — Paweł Dyda, Jan 24 '12 at 18:03
Right, and if you use the Collator with the provided set, your 'ae' and 'æ' are equal. — beerbajay, Jan 25 '12 at 08:20

score 0 · Answer 2 · answered Jul 16 '15 at 16:10

One way to get rules for a specific locale is to use getRules function. However, in Android, this function returns an empty string.

    RuleBasedCollator collTemp = (RuleBasedCollator) Collator
            .getInstance(Locale.US);
    String usRules = collTemp.getRules();


    //Save rules in a file
    String rulesPath = "C:\\projects\\droid\\rules.txt";
    BufferedWriter out = new BufferedWriter
            (new OutputStreamWriter(new FileOutputStream(rulesPath),"UTF-16"));
    out.write(usRules);
    out.close();

These rules are the same ones used by compare function.

if (collTemp.compare(target, str) < 0)

Note: I tried to plug the rules from my JDK desktop app string into Android RuleBasedCollator constructor, but I get U_INVALID_FORMAT_ERROR (in Android only). So I am still trying to figure out how to get the US rules in Android.

Where can I find a specific set of collation rules for equality comparison of strings?

2 Answers2