7

Will Javascript's String prototype method toUpperCase() deliver the naturally expected result in every UTF-8-supported language/charset?

I've tried simplified chinese, south korean, tamil, japanese and cyrillic and the results seemed reasonable so far. Can I rely on the method being language-safe?

Example:

  "イロハニホヘトチリヌルヲワカヨタレソツネナラムウヰノオクヤマケフコエテアサキユメミシヱヒモセス".toUpperCase()
> "イロハニホヘトチリヌルヲワカヨタレソツネナラムウヰノオクヤマケフコエテアサキユメミシヱヒモセス"

Edit: As @Quentin pointed out, there also is a String.prototype.toLocaleUpperCase() which is probably even "safer" to use, but I also have to support IE 8 and above, as well as Webkit-based browsers. Since it is part of ECMAScript 3 Standard, it should be available on all those browsers, right?

Does anyone know of any cases where using it delivers naturally unexpected results?

connexo
  • 53,704
  • 14
  • 91
  • 128
  • 2
    "No" is a safe bet here. There are a *lot* of languages with UTF-8 characters and many of them do not even have the concept of upper or lower case characters. – tadman Jun 10 '15 at 17:08
  • 4
    See also https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/toLocaleUpperCase – Quentin Jun 10 '15 at 17:12
  • Small aside: Please politely inform your Windows XP users that without security updates, they are (98% likely) part of a global botnet that makes network engineers' jobs much harder. – Katana314 Jun 10 '15 at 17:36
  • 1
    @Katana314 an aside that is non-related. Why are you going OT? – connexo Jun 10 '15 at 17:43
  • @connexo Well, because you mentioned that you're supporting IE8 and above. Windows 7, with security updates, will be on IE11, so the most common reason to support IE8 is Windows XP. I usually won't point out minor things like "Your image should have an `alt`!" but for reasonably large issues, people usually at least provide a short comment on them to make sure they're aware; or perhaps include it as a note in their answer. – Katana314 Jun 10 '15 at 17:56
  • @Katana314 Anyone still on IE 8 knows, and is likely there for a reason. No need to badger your users. – Brad Jun 10 '15 at 20:34

2 Answers2

14

What do you expect?

JavaScript's toUpperCase() method is supposed to use the "locale invariant upper case mapping" as defined by the Unicode standard. So, basically, "i".toUpperCase() is supposed to be I in all cases. In cases where the locale invariant upper case mapping consists of multiple letters, most browsers will not upper case them correctly, for example "ß".toUpperCase() is often not SS.

Also, there are locales that have different uppercase rules than the rest of the world, the most notable example being Turkish, where the uppercase version of i is İ (and vice versa) and the lowercase version of I is ı (and vice versa).

If you want that behaviour, you will need a browser that is set to Turkish locale, and you have to use the toLocaleUpperCase() method.

Also note that some writing systems have a third case, "title case", which is applied to the first letter of a word when you want to "capitalize" it. This is also defined by the Unicode standard (for example, the Title case of the ligature njis Nj while the upper case is NJ), but (as far as I know) not available to JavaScript. Therefore if you try to capitalize a word using substring and toUpperCase, expect it to be wrong in rare cases.

mihi
  • 6,507
  • 1
  • 38
  • 48
3

Yes. From the spec:

[Returns] a String where each character is either the Unicode uppercase equivalent of the corresponding character of [the input] or the actual corresponding character of [the input] if no Unicode uppercase equivalent exists.

For the purposes of this operation, the 16-bit code units of the Strings are treated as code points in the Unicode Basic Multilingual Plane. Surrogate code points are directly transferred from [input to output] without any mapping.

The result must be derived according to the case mappings in the Unicode character database (this explicitly includes not only the UnicodeData.txt file, but also the SpecialCasings.txt file that accompanies it in Unicode 2.1.8 and later).

So while this might not exactly match your languages expectations (as many languages use the same characters but not necessarily in the same way), it does certainly deliver the naturally expected result as specified in the Unicode Character Database.

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • 2
    So, for surrogate pairs it is even defined to be wrong (and yes, surrogate pairs also have uppercase mappings, e. g. http://decodeunicode.org/en/u+10428/properties has upper case mapping http://decodeunicode.org/en/u+10400) – mihi Jun 10 '15 at 17:39