.toUpperCase splits some chars in two? Might .toLowerCase do that too?

Question

This morning I was under the impression .toUpperCase and .toLowerCase only translate the basic Latin chars a-z and A-Z and leave the more "exotic" characters alone but of course, on closer inspection, that's really not the case...

console.log( "ﬁ".toLowerCase() ); // this yields a single char
>ﬁ

console.log( "ﬁ".toUpperCase() ); // this yields two chars
>FI

After reading the specs it seems javascript is applying the "Unicode Default Case Conversion algorithm" and it's a whole lot more complicated. The Unicode specs says the various mappings between upper, lower and title case are defined by the two files UnicodeData.txt and SpecialCasing.txt and I don't doubt that, but trying to make sense of them enough to answer my question has brought me to the brink of a brain haemorrhage. Before I go any further I thought I would ask if anyone more familiar with Unicode knows...

edit: Thanks for your suggestions so far but THIS is my question...

Are there any unicode upper to lower case conversions that might split a character into several chars?

And if so, is there a canonical javascript way to do a casing conversion that doesn't split any characters? I want a case conversion method to make a single char substring search case insensitive. Consequently it doesn't matter if the result is a string of mixed case as long as it is consistent i.e. a single character is always translated to a single character, be it upper or lower.

I don't think there is a guaranteed way to make a mirrored conversion between uppercase and lower case. Not for every single case out there. For example, there is a German character `"ß"` - it's called eszett and it represent double `s`. So `ss = ß` as far as normal reading/writing rules are involved: `Strasse = Straße` in normal usage. But `"ß".toUpperCase()` will convert this single character into `"SS"` because it represents *lowercase* `s`-es, there is no uppercase eszett. — VLAZ, Aug 01 '19 at 11:46
https://stackoverflow.com/questions/57256097/converting-case-in-place/57256360#comment101036451_57256360 — daxim, Aug 01 '19 at 12:05

score 0 · Answer 1 · edited Aug 01 '19 at 12:26

You're going to have a problem. Some conversions are required to produce multiple characters. ß is a fancy German way to write ss, but the upper case letter ẞ erroneously does not fall under Unicode's casing rules, so for historic back-compatibility converting to uppercase will make it SS. Similarly, İ (upper case i with dot) lower cases to i̇ (which unfortunately looks like a normal lower case i, but it actually a lowercase i followed by a COMBINING DOT ABOVE). These are literally the two first examples in Unicode's SpecialCasing.txt.

Point is, sometimes, there is no case folding solution that performs one-to-one character conversions. You need to write your algorithm to handle cases where searching for a single character actually searches for a pair of characters, or just accept that your algorithm isn't portable.

Something like this is the usual solution:

Complete case folding to convert both operands to the most caseless, composed form available
Substring search for the normalized needle in the normalized haystack

.toUpperCase splits some chars in two? Might .toLowerCase do that too?

1 Answers1