unicode characters in list of string

Question

I need to break a string into a list of single characters. But my string can have some special characters like "lã" and I want to break it gives me a list with three items ["l", "a", "~"]. How can I manage to get a list with only ["l", "ã"]. Here is my code. It is like this because I've already tried several attempts.

fun getListOfWords (string: String) : List<String>
{
    val list = arrayListOf<String>()
    for(i in 1 .. string.length)
        list.add(string.substring(i-1, i))
    return list
}

When I use it like getListOfWords("lã"). It gives me the correct input, but if I have a string x = "lã" and use getListOfWords(x) it gives me ["l", "a", "~"].

score 1 · Answer 1 · answered Feb 02 '19 at 08:20

This is about Unicode normalisation.

Unicode is quite flexible, and has multiple ways to encode some characters. In particular, ‘ã’ could be encoded as a single character (U+00E3, LATIN SMALL LETTER A WITH TILDE), or as two (U+0061, LATIN SMALL LETTER A, followed by U+0303, COMBINING TILDE). The first is the more standard, ‘normalized’ form, but both will look the same when printed out. Kotlin sees them differently, however, as you've discovered.

Which one you start with will depend where the string comes from. (For example, on the text editor you used to save the source code it's given in, or the text file you load it from.)

The good news is that, whichever form you start with, you can convert it to the form you want using a java.text.Normalizer:

val normalizedString = Normalizer.normalize(string, Normalizer.Form.NFC)

You can then split the result (or do whatever other processing you want).

Alternatively, if you prefer the decomposed form, you can use Normalizer.Form.NFD instead. (See Oracle's tutorial for more info. You can also use the Normalizer to do other processing such as remove diacritics.)

By the way, this means there's nothing wrong with your getListOfWords() function. Well, apart from the name, as it's not actually splitting words — but I guess it's a work in progress! If you really want to split on characters, the built-in ‘String.toList()’ function does exactly the same.

unicode characters in list of string

1 Answers1