I noticed that the Normalizer leaves some non-ascii letters alone, such as the first letter in the name of the Polish city Łódź. Here are some more:
import java.text.Normalizer
fun main() {
for (i in 0xC0..0x170) {
val ch = Char(i)
if (!ch.isLetter()) continue
val norm = Normalizer.normalize(ch.toString(), Normalizer.Form.NFD)
if (norm.length >= 2) {
// println("'$ch' => '${norm[0]}' ${norm[0].code} '${norm[1]}' ${norm[1].code}")
} else {
println("'$ch' => '${norm[0]}' ${norm[0].code}")
}
}
}
This prints:
'Æ' => 'Æ' 198
'Ð' => 'Ð' 208
'Ø' => 'Ø' 216
...
'IJ' => 'IJ' 306
'ij' => 'ij' 307
'ĸ' => 'ĸ' 312
'Ŀ' => 'Ŀ' 319
'ŀ' => 'ŀ' 320
'Ł' => 'Ł' 321
'ł' => 'ł' 322
'ʼn' => 'ʼn' 329
'Ŋ' => 'Ŋ' 330
'ŋ' => 'ŋ' 331
'Œ' => 'Œ' 338
'œ' => 'œ' 339
'Ŧ' => 'Ŧ' 358
'ŧ' => 'ŧ' 359
To me, this somewhat defeats the purpose of the Normalizer -- I assumed I could use it to get an equivalent ASCII for every character in the isLetter
set.
Does anyone know whether this is considered a bug? If not, is there another method that would map 'Ł' to 'L', 'Æ' to 'AE', etc?