java/kotlin Normalizer fails to normalize some accented letters

Question

I noticed that the Normalizer leaves some non-ascii letters alone, such as the first letter in the name of the Polish city Łódź. Here are some more:

import java.text.Normalizer

fun main() {
    for (i in 0xC0..0x170) {
        val ch = Char(i)
        if (!ch.isLetter()) continue
        val norm = Normalizer.normalize(ch.toString(), Normalizer.Form.NFD)
        if (norm.length >= 2) {
            // println("'$ch' => '${norm[0]}' ${norm[0].code} '${norm[1]}' ${norm[1].code}")
        } else {
            println("'$ch' => '${norm[0]}' ${norm[0].code}")
        }
    }
}

This prints:

'Æ' => 'Æ' 198
'Ð' => 'Ð' 208
'Ø' => 'Ø' 216
...
'Ĳ' => 'Ĳ' 306
'ĳ' => 'ĳ' 307
'ĸ' => 'ĸ' 312
'Ŀ' => 'Ŀ' 319
'ŀ' => 'ŀ' 320
'Ł' => 'Ł' 321
'ł' => 'ł' 322
'ŉ' => 'ŉ' 329
'Ŋ' => 'Ŋ' 330
'ŋ' => 'ŋ' 331
'Œ' => 'Œ' 338
'œ' => 'œ' 339
'Ŧ' => 'Ŧ' 358
'ŧ' => 'ŧ' 359

To me, this somewhat defeats the purpose of the Normalizer -- I assumed I could use it to get an equivalent ASCII for every character in the isLetter set.

Does anyone know whether this is considered a bug? If not, is there another method that would map 'Ł' to 'L', 'Æ' to 'AE', etc?

Does this answer your question? [Converting Java String to ascii](https://stackoverflow.com/questions/3707977/converting-java-string-to-ascii) — aSemy, Jul 16 '22 at 20:23
Additionally, for comparing strings in a case-insensitive and accent-insensitive manner, use Collator https://stackoverflow.com/a/2373317/4161471 — aSemy, Jul 16 '22 at 20:36
The problem is not the Normalizer class. The Unicode character data available from unicode.org does not define a decomposition for those characters. Java is just confirming to the Unicode standard. — VGR, Jul 16 '22 at 23:15
Thanks! I was not able to import the `apache` module, but as far as `Collator`, it doesn't solve the problem: (I'll post more as an answer below) — Gideon Av, Jul 17 '22 at 22:56

score 0 · Answer 1 · answered Jul 17 '22 at 23:06

Here's some code using Collator. This, too, doesn't know about the Polish L! So, I accept @VGR's explanation, that some data must just be missing.

import java.text.Collator

fun main() {
    val s = "ŁóźÆæŒĸ"
    val sx = listOf("L","o","z","AE", "ae", "OE", "k")
    val c1 = Collator.getInstance()
    c1.setStrength(Collator.PRIMARY)
    for ((i, ch) in s.withIndex()) {
        val cmp1 = c1.compare(ch.toString(), sx[i])
        println("'$ch' '${sx[i]}' -> $cmp1")
    }
}

Results:

'Ł' 'L' -> 1
'ó' 'o' -> 0
'ź' 'z' -> 0
'Æ' 'AE' -> 0
'æ' 'ae' -> 0
'Œ' 'OE' -> 0
'ĸ' 'k' -> 1

java/kotlin Normalizer fails to normalize some accented letters

1 Answers1