2

I'm iterating over a String in Kotlin and noticed that Kotlin views a certain Chinese character as having length 2. The same character has length 1 in Python 3.8.

Kotlin:

>>> "".length
2

Python:

>>> len("")
1

Why is that the case and how can I iterate over the string in Kotlin character by character?

Mr-Pepe
  • 77
  • 6
  • The "why" sounds like its because its a double-byte character. I've never used kotlin though – Sayse Jan 24 '20 at 09:08
  • Does this answer your question? https://stackoverflow.com/questions/1527856/how-can-i-iterate-through-the-unicode-codepoints-of-a-java-string – Hymns For Disco Jan 24 '20 at 09:23

1 Answers1

3

You are dealing with a surrogate pair. Surrogate pairs are UTF's way of encoding certain characters.

That cannot be represented as one Char. You can check that by attempting to define it as a char literal.

val someChar = '' // Error: Too many character in character literal ""

So how to count those properly? Kotlin's standard library has a function for that (hasSurrogatePairAt) which you could put in an extension function like that:

fun String.countSurrogatePairs() = withIndex().count {
    hasSurrogatePairAt(it.index)
}

Usage:

println("".countSurrogatePairs()) // 1
println("".countSurrogatePairs()) // 2

So, Python seems to already handle that.

Willi Mentzel
  • 27,862
  • 20
  • 113
  • 121