13

Why do the following three characters have not symmetric toLower, toUpper results

/**
  * Written in the Scala programming language, typed into the Scala REPL.
  * Results commented accordingly.
  */
/* Unicode Character 'LATIN CAPITAL LETTER SHARP S' (U+1E9E) */
'\u1e9e'.toHexString == "1e9e" // true
'\u1e9e'.toLower.toHexString == "df" // "df" == "df"
'\u1e9e'.toHexString == '\u1e9e'.toLower.toUpper.toHexString // "1e9e" != "df"
/* Unicode Character 'KELVIN SIGN' (U+212A) */
'\u212a'.toHexString == "212a" // "212a" == "212a"
'\u212a'.toLower.toHexString == "6b" // "6b" == "6b"
'\u212a'.toHexString == '\u212a'.toLower.toUpper.toHexString // "212a" != "4b"
/* Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130) */
'\u0130'.toHexString == "130" // "130" == "130"
'\u0130'.toLower.toHexString == "69" // "69" == "69"
'\u0130'.toHexString == '\u0130'.toLower.toUpper.toHexString // "130" != "49"
Solomon Ucko
  • 5,724
  • 3
  • 24
  • 45
Tim Friske
  • 2,012
  • 1
  • 18
  • 28
  • 3
    Perhaps because Unicode is ambiguous? Some glyphs have multiple representations in Unicode, and `toLower` after `toUpper` or vice versa normalizes to the "lowest" code points. –  Sep 20 '11 at 21:03
  • Jeff Moser's excellent [Turkey Test post](http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html) covered the Turkish I issue in particular. – MPG Sep 26 '11 at 15:01

2 Answers2

13

For the first one, there is this explanation:

In the German language, the Sharp S ("ß" or U+00df) is a lowercase letter, and it capitalizes to the letters "SS".

In other words, U+1E9E lower-cases to U+00DF, but the upper-case of U+00DF is not U+1E9E.

For the second one, U+212A (KELVIN SIGN) lower-cases to U+0068 (LATIN SMALL LETTER K). The upper-case of U+0068 is U+004B (LATIN CAPITAL LETTER K). This one seems to make sense to me.

For the third case, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) is a Turkish/Azerbaijani character that lower-cases to U+0069 (LATIN SMALL LETTER I). I would imagine that if you were somehow in a Turkish/Azerbaijani locale you'd get the proper upper-case version of U+0069, but that might not necessarily be universal.

Characters need not necessarily have symmetric upper- and lower-case transformations.

Edit: To respond to PhiLho's comment below, the Unicode 6.0 spec has this to say about U+212A (KELVIN SIGN):

Three letterlike symbols have been given canonical equivalence to regular letters: U+2126 OHM SIGN, U+212A KELVIN SIGN, and U+212B ANGSTROM SIGN. In all three instances, the regular letter should be used. If text is normalized according to Unicode Standard Annex #15, “Unicode Normalization Forms,” these three characters will be replaced by their regular equivalents.

In other words, you shouldn't really be using U+212A, you should be using U+004B (LATIN CAPITAL LETTER K) instead, and if you normalize your Unicode text, U+212A should be replaced with U+004B.

Community
  • 1
  • 1
CanSpice
  • 34,814
  • 10
  • 72
  • 86
  • 2
    I find wrong to give a lowercase equivalent of the Kelvin sign, the case of unit symbols should never be changed. Ie. even in all caps title, one should really write: "HE RAN 42 km IN 4 h"... – PhiLho Sep 21 '11 at 08:41
  • 3
    People are always confused about Unicode case, because they think everything works like the 26 ASCII letters, and it doesn’t. Think of the situation with the three Greek sigmas, for example. Also, there are lowercase code points that do not change case when mapped, etc. There are actually four kinds of Unicode case, in a sense, where "fold case" is the fourth. To compare two strings case insensitively, you must map each to their case folds and compare the results of that mapping. – tchrist Sep 26 '11 at 15:25
  • 1
    Actually, it is not so much about Unicode as it is about cultural conventions. Germans uppercase sharp s as SS, Unicode only honors that practice. – Mihai Nita Oct 07 '11 at 09:13
  • @tchrist: Mapping to "fold case", how would you do it? Would `uc(lc(c))` do? – maaartinus Sep 30 '13 at 08:52
  • @maaartinus No amount of `uc` or `lc` combinations will reliably get you to the fold case mappings that Unicode provides. That’s why Perl provides an `fc` function. If you’re stuck in Java, you might look into the ICU libraries, which may have something. – tchrist Sep 30 '13 at 10:10
  • @maaartinus Running `unichars 'uc(lc) ne fc && lc(uc) ne fc'` shows ***two round-trip exceptions*** in the Basic Multilingual Plane: `U+0131 ‭ı GC=Ll SC=Latin LATIN SMALL LETTER DOTLESS I` is one and `U+1E9E ‭ẞ GC=Lu SC=Latin LATIN CAPITAL LETTER SHARP S` is the other. The second is “new” and not likely to appear in many texts; it’s an uc character whose lc is `U+00DF ß LATIN SMALL LETTER SHARP `, but the uc of that is in turn `SS`. The *“dotless i”* is a **big problem** in the Turkic languages, which need special casing rules it is so much of a trouble: `uc("ı")` is `I` but `lc("I")` is `i`. – tchrist Sep 30 '13 at 11:00
  • @maaartinus All weird casing situations are in the BMP. There are 116 code points whose fc mapping differs from their lc mapping: 91 are Greek, 18 are Latin, 6 are Armenian, and 1 is `U+00B5 ‭µ GC=Ll SC=Common MICRO SIGN`. The 18 Latin exceptions (derived from running `unichars '\p{Latin}' 'lc ne fc'`) are: `U+00DF ß`, `U+0149 ʼn`, `U+017F ſ`, `U+01F0 ǰ`, `U+1E96 ẖ`, `U+1E97 ẗ`, `U+1E98 ẘ`, `U+1E99 ẙ`, `U+1E9A ẚ`, `U+1E9B ẛ`, `U+1E9E ẞ`, `U+FB00 ff`, `U+FB01 fi`, `U+FB02 fl`, `U+FB03 ffi`, `U+FB04 ffl`, `U+FB05 ſt`, and `U+FB06 st`. The last 7 of those are compatibility characters for old text only. – tchrist Sep 30 '13 at 11:13
  • @tchrist: That's pretty crazy. It inspired me to [a question](http://stackoverflow.com/questions/19135354/assuming-unicode-and-case-insensitivity-should-the-pattern-match-ffiss). – maaartinus Oct 02 '13 at 10:58
  • @maaartinus And me to an answer. Note that Java uses simple case folding in pattern matching, but full case mapping for its string functions. – tchrist Oct 02 '13 at 13:48
3

May I refer to another post about Unicode and upper and lower case.. It is a common mistake to think that signs for a language have to be available in upper and lower case!

Unicode-correct title case in Java

Community
  • 1
  • 1
  • Particularly true for ideograms... :-) – PhiLho Sep 21 '11 at 08:38
  • 1
    You actually cannot do Unicode correct titlecase in Java. There is only a `Character` method, not a `String` method the way is for uppercase and lowercase. This is a real problem. – tchrist Sep 26 '11 at 15:20