3

In the swift doc, they say they use String.Index to index strings, as different characters can take a different amount of memory.

But I saw a lot of people transforming a String into an array var a = Array(s) so they can index by int instead of String.Index (which is definitely easier)

So I wanted to test by myself if it's exactly the same for all unicode character:

let cafeA = "caf\u{E9}" // eAcute
let cafeB = "caf\u{65}\u{301}" // combinedEAcute

let arrayCafeA = Array(cafeA)
let arrayCafeB = Array(cafeB)

print("\(cafeA) is \(cafeA.count) character \(arrayCafeA.count)")
print("\(cafeB) is \(cafeB.count) character \(arrayCafeB.count)")
print(cafeA == cafeB)

print("- A scalar")
for scalar in cafeA.unicodeScalars {
    print(scalar.value)
}
print("- B scalar")
for scalar in cafeB.unicodeScalars {
    print(scalar.value)
}

And here is the output :

café is 4 character 4
café is 4 character 4
true
- A scalar
99
97
102
233
- B scalar
99
97
102
101
769

And sure enough, as mentioned in the doc strings are just an array of Character, and then the grapheme cluster is down within the Character object, so why don't they indexed it by int ? what's the point of creating/using String.Index actually ?

  • 1
    "strings are just an array of Character" they're actually not. They're a buffer that stores grapheme clusters, whose variable-width `Character` boundaries aren't known without walking through the buffer from the start. – Alexander Jan 18 '18 at 16:55

1 Answers1

3

In a String, the byte representation is packed, so there's no way to know where the character boundaries are without traversing the whole string from the start.

When converting to an array, this is traversal is done once, and the result is an array of characters that are equidistantly spaced out in memory, which is what allows constant time subscripting by an Int index. Importantly, the array is preserved, so many subscripting operations can be done upon the same array, requiring only one traversal of the String's bytes, for the initial unpacking.

It is possible extend String with a subscript that indexes it by an Int, and you see it often come up on SO, but that's ill advised. The standard library programmers could have added it, but they purposely chose not to, because it obscures the fact that every indexing operation requires a separate traversal of the String's bytes, which is O(string.count). All of a sudden, innocuous code like this:

for i in string.indices {
    print(string[i]) // Looks O(1), but is actually O(string.count)!
}

becomes quadratic.

Alexander
  • 59,041
  • 12
  • 98
  • 151
  • Ok thank you now it’s clear ! Just for curiosity, if the string is packed, how does it know how and when to cluster unicode scalars together ? Does it depends on the chararacter themselves (eg cute accent will always be clustered with the e) ? –  Jan 18 '18 at 19:14
  • 1
    @Julien That... That is a whole can of worms. It's called a segmentation algorithm, and you can get more info on it here: http://www.unicode.org/reports/tr29/ – Alexander Jan 18 '18 at 19:27
  • 1
    @Julien In your example, yes, the "eg cute accent will always be clustered with the e", but there are a ***lot*** of ***really*** complicated edge cases – Alexander Jan 18 '18 at 19:27