15

I started learning Swift language and I am very curious What does it mean that string and character comparisons in Swift are not locale-sensitive? Does it mean that all the characters are stored in Swift like UTF-8 characters?

Martin R
  • 529,903
  • 94
  • 1,240
  • 1,382
Dmytro Plekhotkin
  • 1,965
  • 2
  • 23
  • 47

3 Answers3

29

(All code examples updated for Swift 3 now.)

Comparing Swift strings with < does a lexicographical comparison based on the so-called "Unicode Normalization Form D" (which can be computed with decomposedStringWithCanonicalMapping)

For example, the decomposition of

"ä" = U+00E4 = LATIN SMALL LETTER A WITH DIAERESIS

is the sequence of two Unicode code points

U+0061,U+0308 = LATIN SMALL LETTER A + COMBINING DIAERESIS

For demonstration purposes, I have written a small String extension which dumps the contents of the String as an array of Unicode code points:

extension String {
    var unicodeData : String {
        return self.unicodeScalars.map {
            String(format: "%04X", $0.value)
            }.joined(separator: ",")
    }
}

Now lets take some strings, sort them with <:

let someStrings = ["ǟψ", "äψ", "ǟx", "äx"].sorted()
print(someStrings)
// ["a", "ã", "ă", "ä", "ǟ", "b"]

and dump the Unicode code points of each string (in original and decomposed form) in the sorted array:

for str in someStrings {
    print("\(str)  \(str.unicodeData)  \(str.decomposedStringWithCanonicalMapping.unicodeData)")
}

The output

äx  00E4,0078  0061,0308,0078
ǟx  01DF,0078  0061,0308,0304,0078
ǟψ  01DF,03C8  0061,0308,0304,03C8
äψ  00E4,03C8  0061,0308,03C8

nicely shows that the comparison is done by a lexicographic ordering of the Unicode code points in the decomposed form.

This is also true for strings of more than one character, as the following example shows. With

let someStrings = ["ǟψ", "äψ", "ǟx", "äx"].sorted()

the output of above loop is

äx  00E4,0078  0061,0308,0078
ǟx  01DF,0078  0061,0308,0304,0078
ǟψ  01DF,03C8  0061,0308,0304,03C8
äψ  00E4,03C8  0061,0308,03C8

which means that

"äx" < "ǟx", but "äψ" > "ǟψ"

(which was at least unexpected for me).

Finally let's compare this with a locale-sensitive ordering, for example swedish:

let locale = Locale(identifier: "sv") // svenska
var someStrings = ["ǟ", "ä", "ã", "a", "ă", "b"]
someStrings.sort {
    $0.compare($1, locale: locale) == .orderedAscending
}

print(someStrings)
// ["a", "ă", "ã", "b", "ä", "ǟ"]

As you see, the result is different from the Swift < sorting.

Martin R
  • 529,903
  • 94
  • 1,240
  • 1,382
  • 2
    Addition/details ("official" ref.: from open source): from the [String.swift source code](https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift) we can see that e.g. the `<` operator for `String` is defined as `lhs._compareString(rhs) < 0` (which use `_swift_stdlib_unicode_compare_utf8_utf8`, itself), which we can track via https://github.com/apple/swift/blob/master/stdlib/public/stubs/UnicodeNormalization.cpp to `ucol_strcollIter` (see `MakeRootCollator` for collator settings) from ICU lib; i.e., using the [unicode collation algorithm](http://unicode.org/reports/tr10/). – dfrib Jul 12 '16 at 14:47
  • ... ([link to relevant ICU lib](http://icu-project.org/apiref/icu4c/ucol_8h.html)) – dfrib Jul 12 '16 at 14:49
  • @dfri: Thanks for providing the links, much appreciated. I *think* it was also mentioned somewhere in Apple's documentation, but I cannot find it anymore. – Martin R Jul 12 '16 at 16:41
  • Happy to help. I also recall I've seen some mention of this in the docs, but no had success when I tried to find it earlier today. Seems like the Swift reference docs change quicker than the stdlib itself, and only by ninja-edits :) – dfrib Jul 12 '16 at 20:06
  • Your last example confused me as I am Swedish and the order should be `[a, b, å, ä]` (not sure about ă and ã but I guess they should be between a and b) since å and ä are separate letters that comes after z in the Swedish alphabet. After some time I realised that you entered locale identifier "se", which is the country code for Sweden but the language code for Northern Sami. The correct language code for Swedish is "sv" :) – LoPoBo Apr 07 '17 at 12:10
  • @LoPoBo: I apologize to all Swedish speaking people! With language code "sv" the result is `["a", "ă", "ã", "b", "ä", "ǟ"]` – does that look correct? I will update the answer tonight (and update it for Swift 3 as well). Thanks for your feedback, much appreciated! – Martin R Apr 07 '17 at 12:27
  • Yes that seems correct. When I wrote `[a, b, å, ä]` I had mistaken `ǟ` for `å` (they look quite similar in the code font). Only a, b and ä are regular letters of the Swedish alphabet, but the order of the other ones seems logical. – LoPoBo Apr 07 '17 at 13:18
  • 1
    Does the `<` operator guarantee transitivity (a < b and b < c implies a < c)? I ran into an example where this breaks, and I posted the example here https://stackoverflow.com/questions/46230471/string-comparison-in-swift-is-not-transitive . From your explanation I think it should guarantee transitivity but obviously this is not the case. – Pinch Sep 15 '17 at 02:19
1

Changing the locale can change the alphabetical order, e.g. a case-sensitive comparison can appear case-insensitive because of the locale, or more generally, the alphabetical order of two strings is different.

Miro Lehtonen
  • 609
  • 6
  • 18
1

Lexicographical ordering and locale-sensitive ordering can be different. You can see an example of it in this question: Sorting scala list equivalent to C# without changing C# order

In that specific case the locale-sensitive ordering placed _ before 1, whereas in a lexicographical ordering it's the opposite.

Swift comparison uses lexicographical ordering.

Community
  • 1
  • 1
Gabriele Petronella
  • 106,943
  • 21
  • 217
  • 235
  • Does lexicographical ordering mean alphabetical order? What about characters from different alphabets (each country has its own alphabet), how it knows what characters to consider as first? – Dmytro Plekhotkin Sep 07 '14 at 19:57
  • @GabrielePetronella: That's what I thought as well, but all the expressions `"a" < "ä"`, `"ä" < "b"` and `ClosedInterval("a", "b").contains("ä")` return true in my test project. – Martin R Sep 07 '14 at 20:45