This is about Unicode normalisation.
Unicode is quite flexible, and has multiple ways to encode some characters. In particular, ‘ã’ could be encoded as a single character (U+00E3, LATIN SMALL LETTER A WITH TILDE), or as two (U+0061, LATIN SMALL LETTER A, followed by U+0303, COMBINING TILDE). The first is the more standard, ‘normalized’ form, but both will look the same when printed out. Kotlin sees them differently, however, as you've discovered.
Which one you start with will depend where the string comes from. (For example, on the text editor you used to save the source code it's given in, or the text file you load it from.)
The good news is that, whichever form you start with, you can convert it to the form you want using a java.text.Normalizer
:
val normalizedString = Normalizer.normalize(string, Normalizer.Form.NFC)
You can then split the result (or do whatever other processing you want).
Alternatively, if you prefer the decomposed form, you can use Normalizer.Form.NFD
instead. (See Oracle's tutorial for more info. You can also use the Normalizer to do other processing such as remove diacritics.)
By the way, this means there's nothing wrong with your getListOfWords()
function. Well, apart from the name, as it's not actually splitting words — but I guess it's a work in progress! If you really want to split on characters, the built-in ‘String.toList()’ function does exactly the same.