3

This morning I was under the impression .toUpperCase and .toLowerCase only translate the basic Latin chars a-z and A-Z and leave the more "exotic" characters alone but of course, on closer inspection, that's really not the case...

console.log( "fi".toLowerCase() ); // this yields a single char
>fi

console.log( "fi".toUpperCase() ); // this yields two chars
>FI

After reading the specs it seems javascript is applying the "Unicode Default Case Conversion algorithm" and it's a whole lot more complicated. The Unicode specs says the various mappings between upper, lower and title case are defined by the two files UnicodeData.txt and SpecialCasing.txt and I don't doubt that, but trying to make sense of them enough to answer my question has brought me to the brink of a brain haemorrhage. Before I go any further I thought I would ask if anyone more familiar with Unicode knows...


edit: Thanks for your suggestions so far but THIS is my question...

Are there any unicode upper to lower case conversions that might split a character into several chars?


And if so, is there a canonical javascript way to do a casing conversion that doesn't split any characters? I want a case conversion method to make a single char substring search case insensitive. Consequently it doesn't matter if the result is a string of mixed case as long as it is consistent i.e. a single character is always translated to a single character, be it upper or lower.

Roger Heathcote
  • 3,091
  • 1
  • 33
  • 39
  • I don't think there is a guaranteed way to make a mirrored conversion between uppercase and lower case. Not for every single case out there. For example, there is a German character `"ß"` - it's called eszett and it represent double `s`. So `ss = ß` as far as normal reading/writing rules are involved: `Strasse = Straße` in normal usage. But `"ß".toUpperCase()` will convert this single character into `"SS"` because it represents *lowercase* `s`-es, there is no uppercase eszett. – VLAZ Aug 01 '19 at 11:46
  • https://stackoverflow.com/questions/57256097/converting-case-in-place/57256360#comment101036451_57256360 – daxim Aug 01 '19 at 12:05

1 Answers1

0

You're going to have a problem. Some conversions are required to produce multiple characters. ß is a fancy German way to write ss, but the upper case letter erroneously does not fall under Unicode's casing rules, so for historic back-compatibility converting to uppercase will make it SS. Similarly, İ (upper case i with dot) lower cases to (which unfortunately looks like a normal lower case i, but it actually a lowercase i followed by a COMBINING DOT ABOVE). These are literally the two first examples in Unicode's SpecialCasing.txt.

Point is, sometimes, there is no case folding solution that performs one-to-one character conversions. You need to write your algorithm to handle cases where searching for a single character actually searches for a pair of characters, or just accept that your algorithm isn't portable.

Something like this is the usual solution:

  1. Complete case folding to convert both operands to the most caseless, composed form available
  2. Substring search for the normalized needle in the normalized haystack
daxim
  • 39,270
  • 4
  • 65
  • 132
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271