5

I've read docs about String and Unicode in Swift 5, but couldn't understand why we can't get a Character from String as:

let someString = ""
let oneCharacter = someString[2] // Error

Why should we use a more complex way of getting a Character?

let strawberryIndex = someString.index(someString.startIndex, offsetBy: 2) // String.Index type
someString[strawberryIndex] // Character()

What's the point of using type String.Index?

Boann
  • 48,794
  • 16
  • 117
  • 146
Kirill Semenov
  • 165
  • 1
  • 8

4 Answers4

2

First, you can't use Int as an index for a string. The interface requires String.Index.

Why? We are using Unicode, not ASCII. The unit for Swift strings is a Character, which is "Grapheme Cluster". A character can consist of multiple Unicode code points, and each Unicode code point can consist of 1 to 4 bytes.

Now lets say you have a string of 10 megabyte and did a search to find the substring "Wysteria". Would you want to return which character number the string starts with? If it's character 123,456 then to find the same string again, we have to start at the beginning of the string, and analyze 123,456 characters to find that substring. That is madly inefficient.

Instead we get a String.Index which is something that allows Swift to locate that substring quickly. It is most likely the byte offset, so it can be accessed very quickly.

Now adding "1" to that byte offset is nonsense, because you don't know how long the first character is. (It's quite possible that Unicode has another character that equals the ASCII 'W'). So you need to call a function that returns the index of the next character.

You can write code that returns the second Character from a string. To return the one millionth Character takes significant time. Swift doesn't allow you to do things that are enormously inefficient.

gnasher729
  • 51,477
  • 5
  • 75
  • 98
2

Swift abstracts over string indices for several reasons. The primary intent, as far as I can tell, is to make people stop thinking they're just integers. Under the hood, they are, but they behave counter to people initial expectations.

ASCII as a "default"

Our expectations of String encoding are usually pretty English-centric. ASCII is usually the first character encoding people are taught, and usually with some pretext that it's somehow the most popular, or most standard, etc.

The issue is, most users aren't Americans. They're western Europeans who need lots of different accents on their Latin alphabets, or eastern Europeans who want Cyrillic alphabets, or Chinese users who have a bunch of different characters (over 74,000! that they need to be able to write. ASCII was never aimed to be an international standard for encoding all languages. The American Standard's Association created ASCII for encoding characters relevant to the US market. Other countries made their own character encodings for their own needs.

The advent of Unicode

Having regional character encodings worked, until international communication with computers became more prevalent. These fragmented character encodings weren't inter-operable with each other, causing all kinds of garbled text and user confusion.There needed to be a new standard to unify them and allow standardized encoding, world-wide.

Thus, Unicode was invented as the one ring to rule them all. A single code table, containing all of the characters of all of the languages, with plenty of room for future expansion.

1 byte per character

In ASCII, there are 127 possible characters. Each character in a string is encoded as a single 8-bit byte. This means that for an n character string, you have exactly n bytes. Subscripting to get the ith character is a matter of simple pointer arithmetic, just like any array-subscripting.:

address_of_element_i = base_address + (size_of_each_element * i)

With size_of_each_element being just 1 (byte), this reduces further to just base_address + i. This was really fast, and it worked.

This 1-byte-per-character quality of ASCII informed the API design of the string types in many (most?) programming languages' standard libraries. Even though ASCII is the wrong choice for a "default" encoding (and has been for decades), by the time Unicode became ubiquitous, the damage was done.

Extended grapheme clusters

What users perceive as characters are called "extended grapheme clusters" in Unicode. They're a base character, optionally followed by any number of continuing characters. This smashed the "1 character is 1 byte" assumption that many languages were built on.

Thinking of characters are bytes is broken in the unicode world. Not "oh it's good enough, we'll worry about it when we expand to international markets", but absolutely and totally unworkable. Most users don't speak English. English users use Emojis. The assumptions built up from ASCII just don't work anymore. Take Python 2.7 for example, this works fine:

>>> s = "Hello, World!"
>>> print(s)
Hello, World!
>>> print(s[7]) 
W

And this does not:

>>> s = ""
>>> print(s)

>>> print([2])
[2]
>>> print(s[2])
�

In Python 3, a breaking change was introduced: indices now represent code points, not bytes. So now the code above works "as expected", printing . But it's still not sufficient. Multi-code-point code is still broken, for example:

>>> s = "A‍‍‍Z"
>>> print(s[0])
A
>>> print(s[1])

>>> print(s[2]) # Zero width joiner
 ‍
>>> print(s[3])

>>> print(s[4])
 ‍
>>> print(s[5])

>>> print(s[6])
 ‍
>>> print(s[7])

>>> print(s[8])
Z
>>> print(s[9]) # Last index
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range

Swift handles this trivially:

  1> let s = "A‍‍‍Z"
s: String = "A‍‍‍Z"
  2> s[s.index(s.startIndex, offsetBy: +0)]
$R0: Character = "A"
  3> s[s.index(s.startIndex, offsetBy: +1)]
$R1: Character = "‍‍‍"
  4> s[s.index(s.startIndex, offsetBy: +2)]
$R2: Character = "Z"

Trade-offs

Subscripting by characters is slow in Unicode. You're forced to walk the string, starting from the beginning, applying the grapheme-breaking rules as you go, counting until you reach the desired count. This is an O(n) process, unlike the O(1) in the ASCII case.

If this code were hidden behind a subscript operator, code like:

for i in 0..<str.count {
    print(str[i])
}

Might look like it's O(str.count) (after all "there's only one for loop", right?!), but it's actually O(str.count^2), because each str[i] operation is hiding a linear walk through the string, which happens over and over again.

The Swift String API

Swift's String API is trying to force people away from direct indexing, and toward alternate patterns that don't involve manual indexing, such as:

  1. String.prefix/String.suffix for chopping the start or end off a string to get a slice
  2. Using String.map to transform all the characters in the string
  3. and using other built-ins for uppercasing, lowercasing, reversing, trimming etc.

Swift's String API isn't fully complete yet. There's a lot of desire/intent to improve its ergonomics.

However, much of the string-processing code people are used to writing is just plain wrong. They might just might have never noticed, because they've never tried to use it in a foreign language or with Emojis. String is trying to be correct-by-default, and make it hard to make internationalization mistakes.

Alexander
  • 59,041
  • 12
  • 98
  • 151
1

As you can see from the links/information others are providing (and How does String.Index work in Swift), it's about performance.

RandomAccessCollection makes the guarantee that it "can move indices any distance and measure the distance between indices in O(1) time." String can't do that.

You can just do this, and it will work, but it will break that contract.

extension RandomAccessCollection {
  subscript(position: Int) -> Element {
    self[index(startIndex, offsetBy: position)]
  }
}
extension Substring: RandomAccessCollection { }
extension String: RandomAccessCollection { }
""[2] // ""

Something like this, however, I recommend!

public extension Collection {
  /// - Complexity: O(`position`)
  subscript(startIndexOffsetBy position: Int) -> Element {
    self[index(startIndex, offsetBy: position)]
  }
}
""[startIndexOffsetBy: 2]