Swift abstracts over string indices for several reasons. The primary intent, as far as I can tell, is to make people stop thinking they're just integers. Under the hood, they are, but they behave counter to people initial expectations.
ASCII as a "default"
Our expectations of String encoding are usually pretty English-centric. ASCII is usually the first character encoding people are taught, and usually with some pretext that it's somehow the most popular, or most standard, etc.
The issue is, most users aren't Americans. They're western Europeans who need lots of different accents on their Latin alphabets, or eastern Europeans who want Cyrillic alphabets, or Chinese users who have a bunch of different characters (over 74,000! that they need to be able to write. ASCII was never aimed to be an international standard for encoding all languages. The American Standard's Association created ASCII for encoding characters relevant to the US market. Other countries made their own character encodings for their own needs.
The advent of Unicode
Having regional character encodings worked, until international communication with computers became more prevalent. These fragmented character encodings weren't inter-operable with each other, causing all kinds of garbled text and user confusion.There needed to be a new standard to unify them and allow standardized encoding, world-wide.
Thus, Unicode was invented as the one ring to rule them all. A single code table, containing all of the characters of all of the languages, with plenty of room for future expansion.
1 byte per character
In ASCII, there are 127 possible characters. Each character in a string is encoded as a single 8-bit byte. This means that for an n
character string, you have exactly n
bytes. Subscripting to get the i
th character is a matter of simple pointer arithmetic, just like any array-subscripting.:
address_of_element_i = base_address + (size_of_each_element * i)
With size_of_each_element
being just 1 (byte), this reduces further to just base_address + i
. This was really fast, and it worked.
This 1-byte-per-character quality of ASCII informed the API design of the string types in many (most?) programming languages' standard libraries. Even though ASCII is the wrong choice for a "default" encoding (and has been for decades), by the time Unicode became ubiquitous, the damage was done.
Extended grapheme clusters
What users perceive as characters are called "extended grapheme clusters" in Unicode. They're a base character, optionally followed by any number of continuing characters. This smashed the "1 character is 1 byte" assumption that many languages were built on.
Thinking of characters are bytes is broken in the unicode world. Not "oh it's good enough, we'll worry about it when we expand to international markets", but absolutely and totally unworkable. Most users don't speak English. English users use Emojis. The assumptions built up from ASCII just don't work anymore. Take Python 2.7 for example, this works fine:
>>> s = "Hello, World!"
>>> print(s)
Hello, World!
>>> print(s[7])
W
And this does not:
>>> s = ""
>>> print(s)
>>> print([2])
[2]
>>> print(s[2])
�
In Python 3, a breaking change was introduced: indices now represent code points, not bytes. So now the code above works "as expected", printing
. But it's still not sufficient. Multi-code-point code is still broken, for example:
>>> s = "AZ"
>>> print(s[0])
A
>>> print(s[1])
>>> print(s[2]) # Zero width joiner
>>> print(s[3])
>>> print(s[4])
>>> print(s[5])
>>> print(s[6])
>>> print(s[7])
>>> print(s[8])
Z
>>> print(s[9]) # Last index
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range
Swift handles this trivially:
1> let s = "AZ"
s: String = "AZ"
2> s[s.index(s.startIndex, offsetBy: +0)]
$R0: Character = "A"
3> s[s.index(s.startIndex, offsetBy: +1)]
$R1: Character = ""
4> s[s.index(s.startIndex, offsetBy: +2)]
$R2: Character = "Z"
Trade-offs
Subscripting by characters is slow in Unicode. You're forced to walk the string, starting from the beginning, applying the grapheme-breaking rules as you go, counting until you reach the desired count. This is an O(n)
process, unlike the O(1)
in the ASCII case.
If this code were hidden behind a subscript operator, code like:
for i in 0..<str.count {
print(str[i])
}
Might look like it's O(str.count)
(after all "there's only one for loop", right?!), but it's actually O(str.count^2)
, because each str[i]
operation is hiding a linear walk through the string, which happens over and over again.
The Swift String API
Swift's String API is trying to force people away from direct indexing, and toward alternate patterns that don't involve manual indexing, such as:
String.prefix
/String.suffix
for chopping the start or end off a string to get a slice
- Using
String.map
to transform all the characters in the string
- and using other built-ins for uppercasing, lowercasing, reversing, trimming etc.
Swift's String API isn't fully complete yet. There's a lot of desire/intent to improve its ergonomics.
However, much of the string-processing code people are used to writing is just plain wrong. They might just might have never noticed, because they've never tried to use it in a foreign language or with Emojis. String is trying to be correct-by-default, and make it hard to make internationalization mistakes.