Character length is 1 in Python, but 2 in Kotlin

Question

I'm iterating over a String in Kotlin and noticed that Kotlin views a certain Chinese character as having length 2. The same character has length 1 in Python 3.8.

Kotlin:

>>> "".length
2

Python:

>>> len("")
1

Why is that the case and how can I iterate over the string in Kotlin character by character?

The "why" sounds like its because its a double-byte character. I've never used kotlin though — Sayse, Jan 24 '20 at 09:08
Does this answer your question? https://stackoverflow.com/questions/1527856/how-can-i-iterate-through-the-unicode-codepoints-of-a-java-string — Hymns For Disco, Jan 24 '20 at 09:23

Willi Mentzel · Accepted Answer · 2020-01-24T12:18:10.160

You are dealing with a surrogate pair. Surrogate pairs are UTF's way of encoding certain characters.

That cannot be represented as one Char. You can check that by attempting to define it as a char literal.

val someChar = '' // Error: Too many character in character literal ""

So how to count those properly? Kotlin's standard library has a function for that (hasSurrogatePairAt) which you could put in an extension function like that:

fun String.countSurrogatePairs() = withIndex().count {
    hasSurrogatePairAt(it.index)
}

Usage:

println("".countSurrogatePairs()) // 1
println("".countSurrogatePairs()) // 2

So, Python seems to already handle that.

Character length is 1 in Python, but 2 in Kotlin

1 Answers1