16

I was browsing through the openjdk and noticed a weird code path in String.equalsIgnoreCase, specifically the method regionMatches:

if (ignoreCase) {
    // If characters don't match but case may be ignored,
    // try converting both characters to uppercase.
    // If the results match, then the comparison scan should
    // continue.
    char u1 = Character.toUpperCase(c1);
    char u2 = Character.toUpperCase(c2);
    if (u1 == u2) {
        continue;
    }
    // Unfortunately, conversion to uppercase does not work properly
    // for the Georgian alphabet, which has strange rules about case
    // conversion.  So we need to make one last check before
    // exiting.
    if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
        continue;
    }
}

I understand the comment about adjusting for a specific alphabet to check the lower case equality, but was wondering why even have the upper case check? Why not just do all lower case?

Dan W
  • 5,718
  • 4
  • 33
  • 44
  • Are you asking that instead of using `toUpperCase()` why didn't they use `toLowerCase()`? – Kayaman Aug 26 '14 at 18:48
  • 3
    This question seems to be similar to this: http://stackoverflow.com/questions/15518731/understanding-logic-in-caseinsensitivecomparator – Glen Keane Aug 26 '14 at 18:49
  • 2
    My guess is that `toLowerCase` would have a similar problem: if it failed you would then need to try `toUpperCase`. It's just a coin toss which one is done first. – Ted Hopp Aug 26 '14 at 18:50
  • @Kayaman Yes. Why even have the toUpperCase checks? – Dan W Aug 26 '14 at 18:50
  • @DanW Well if the case is ignored, they'll have to be both either uppercase or lower case to test for similarity, won't they? – Kayaman Aug 26 '14 at 18:51
  • @Kayaman I'm just asking why have the upper case check if it appears by the comment that lower case handles more cases. However, as TedHopp points out, there's probably a similar issue with toLowerCase. – Dan W Aug 26 '14 at 18:53
  • 1
    @DanW The comment talks about "strange rules about case conversion". It says nothing about lower case handling more cases. – Kayaman Aug 26 '14 at 18:56
  • 2
    @EJP Why was this marked as a duplicate? The suggested duplicate post just points to the same line of code I questioned about. It does not say why the toUpperCase check is needed. And if there is a situation where toLowerCase would return false for a check where toUpperCase works, that is not cited. – Dan W Aug 26 '14 at 19:30
  • 2
    The other post does not ask the same question and none of the answers say why toLowerCase isn't used solely. There's no mention of another alphabet having issues with toUpperCase -- that's only being assumed by some commenters here. Not a duplicate in answer or question. – AHungerArtist Aug 26 '14 at 19:42
  • 1
    http://stackoverflow.com/a/25513639/900873 – Kevin Aug 26 '14 at 20:48
  • @Kevin If you want to post that I'll accept as the answer. – Dan W Aug 26 '14 at 20:50
  • 1
    @Kevin Worth noting that answer was added on hour ago and didn't exist when this question was asked. Glad to see it answered, though. – AHungerArtist Aug 26 '14 at 20:51
  • 1
    But *this* question is now a duplicate since its answer is one of the answers of http://stackoverflow.com/questions/15518731/understanding-logic-in-caseinsensitivecomparator – Serge Ballesta Aug 26 '14 at 22:14
  • @DanW Cristian Semrau [already provided a counterexample](http://stackoverflow.com/a/25513639/451518). Actually, you can use simplistic brute force check to search for code points that satisfy given conditions. Take a look on the code here: http://ideone.com/DgPx23 – default locale Aug 27 '14 at 04:47
  • @defaultlocale Yes, I see that now. However, that answer was posted after I asked this question, it got closed, and then re-opened. – Dan W Aug 27 '14 at 13:51
  • 2
    @defaultlocale I indented to give my answer to this question. :-) It got closed while I was preparing my answer, so I answered the other question. – Christian Semrau Aug 27 '14 at 14:28

1 Answers1

15

Now that the question is re-opened, I transfer my answer here.

The short answer to "Why do they not just compare only lowercase instead of both upper and lower case, if it matches more cases than uppercase?": It does not match more character pairs, it merely matches different pairs.

Comparing only uppercase is not enough, e.g. the ASCII letter "I" and the capital I with dot "İ" ((char)304, used in Turkish alphabet) have different uppercase (they are already uppercase), but they have the same lowercase letter "i". (Note that the Turkish language considers i with dot and i without dot as different letters, not just an accented letter, similar to German with its Umlauts ä/ö/ü vs. a/o/u.)

Comparing only lowercase is not enough, e.g. the ASCII letter "i" and the small dotless i "ı" ((char)305). They have different lowercase (they are already lowercase), but they have the same uppercase letter "I".

And finally, compare capital I with dot "İ" with small dotless i "ı". Neither their uppercases ("İ" vs. "I") nor their lowercases ("i" vs. "ı") match, but the lowercase of their uppercase is the same ("I"). I found another case if this phenomenon, in the greek letters "ϴ" and "ϑ" (char 1012 and 977).

So a true case insensitive comparison can not even check uppercases and lowercases of the original characters, but must check the lowercases of the uppercases.

Christian Semrau
  • 8,913
  • 2
  • 32
  • 39