String.equalsIgnoreCase - UpperCase v. LowerCase

Question

I was browsing through the openjdk and noticed a weird code path in String.equalsIgnoreCase, specifically the method regionMatches:

if (ignoreCase) {
    // If characters don't match but case may be ignored,
    // try converting both characters to uppercase.
    // If the results match, then the comparison scan should
    // continue.
    char u1 = Character.toUpperCase(c1);
    char u2 = Character.toUpperCase(c2);
    if (u1 == u2) {
        continue;
    }
    // Unfortunately, conversion to uppercase does not work properly
    // for the Georgian alphabet, which has strange rules about case
    // conversion.  So we need to make one last check before
    // exiting.
    if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
        continue;
    }
}

I understand the comment about adjusting for a specific alphabet to check the lower case equality, but was wondering why even have the upper case check? Why not just do all lower case?

Are you asking that instead of using `toUpperCase()` why didn't they use `toLowerCase()`? — Kayaman, Aug 26 '14 at 18:48
This question seems to be similar to this: http://stackoverflow.com/questions/15518731/understanding-logic-in-caseinsensitivecomparator — Glen Keane, Aug 26 '14 at 18:49
My guess is that `toLowerCase` would have a similar problem: if it failed you would then need to try `toUpperCase`. It's just a coin toss which one is done first. — Ted Hopp, Aug 26 '14 at 18:50
@DanW Well if the case is ignored, they'll have to be both either uppercase or lower case to test for similarity, won't they? — Kayaman, Aug 26 '14 at 18:51
@Kayaman I'm just asking why have the upper case check if it appears by the comment that lower case handles more cases. However, as TedHopp points out, there's probably a similar issue with toLowerCase. — Dan W, Aug 26 '14 at 18:53
@DanW The comment talks about "strange rules about case conversion". It says nothing about lower case handling more cases. — Kayaman, Aug 26 '14 at 18:56
@EJP Why was this marked as a duplicate? The suggested duplicate post just points to the same line of code I questioned about. It does not say why the toUpperCase check is needed. And if there is a situation where toLowerCase would return false for a check where toUpperCase works, that is not cited. — Dan W, Aug 26 '14 at 19:30
The other post does not ask the same question and none of the answers say why toLowerCase isn't used solely. There's no mention of another alphabet having issues with toUpperCase -- that's only being assumed by some commenters here. Not a duplicate in answer or question. — AHungerArtist, Aug 26 '14 at 19:42
@Kevin Worth noting that answer was added on hour ago and didn't exist when this question was asked. Glad to see it answered, though. — AHungerArtist, Aug 26 '14 at 20:51
But *this* question is now a duplicate since its answer is one of the answers of http://stackoverflow.com/questions/15518731/understanding-logic-in-caseinsensitivecomparator — Serge Ballesta, Aug 26 '14 at 22:14
@DanW Cristian Semrau [already provided a counterexample](http://stackoverflow.com/a/25513639/451518). Actually, you can use simplistic brute force check to search for code points that satisfy given conditions. Take a look on the code here: http://ideone.com/DgPx23 — default locale, Aug 27 '14 at 04:47
@defaultlocale Yes, I see that now. However, that answer was posted after I asked this question, it got closed, and then re-opened. — Dan W, Aug 27 '14 at 13:51
@defaultlocale I indented to give my answer to this question. :-) It got closed while I was preparing my answer, so I answered the other question. — Christian Semrau, Aug 27 '14 at 14:28

score 15 · Accepted Answer · answered Aug 27 '14 at 14:37

Now that the question is re-opened, I transfer my answer here.

The short answer to "Why do they not just compare only lowercase instead of both upper and lower case, if it matches more cases than uppercase?": It does not match more character pairs, it merely matches different pairs.

Comparing only uppercase is not enough, e.g. the ASCII letter "I" and the capital I with dot "İ" ((char)304, used in Turkish alphabet) have different uppercase (they are already uppercase), but they have the same lowercase letter "i". (Note that the Turkish language considers i with dot and i without dot as different letters, not just an accented letter, similar to German with its Umlauts ä/ö/ü vs. a/o/u.)

Comparing only lowercase is not enough, e.g. the ASCII letter "i" and the small dotless i "ı" ((char)305). They have different lowercase (they are already lowercase), but they have the same uppercase letter "I".

And finally, compare capital I with dot "İ" with small dotless i "ı". Neither their uppercases ("İ" vs. "I") nor their lowercases ("i" vs. "ı") match, but the lowercase of their uppercase is the same ("I"). I found another case if this phenomenon, in the greek letters "ϴ" and "ϑ" (char 1012 and 977).

So a true case insensitive comparison can not even check uppercases and lowercases of the original characters, but must check the lowercases of the uppercases.

This is a modified copy of my answer at http://stackoverflow.com/a/25513639/282229 — Christian Semrau, Aug 27 '14 at 14:37

String.equalsIgnoreCase - UpperCase v. LowerCase

1 Answers1

Linked

Related