What does it mean that string and character comparisons in Swift are not locale-sensitive?

Question

I started learning Swift language and I am very curious What does it mean that string and character comparisons in Swift are not locale-sensitive? Does it mean that all the characters are stored in Swift like UTF-8 characters?

Martin R · Accepted Answer · 2017-04-09T08:11:36.967

29

(All code examples updated for Swift 3 now.)

Comparing Swift strings with < does a lexicographical comparison based on the so-called "Unicode Normalization Form D" (which can be computed with decomposedStringWithCanonicalMapping)

For example, the decomposition of

"ä" = U+00E4 = LATIN SMALL LETTER A WITH DIAERESIS

is the sequence of two Unicode code points

U+0061,U+0308 = LATIN SMALL LETTER A + COMBINING DIAERESIS

For demonstration purposes, I have written a small String extension which dumps the contents of the String as an array of Unicode code points:

extension String {
    var unicodeData : String {
        return self.unicodeScalars.map {
            String(format: "%04X", $0.value)
            }.joined(separator: ",")
    }
}

Now lets take some strings, sort them with <:

let someStrings = ["ǟψ", "äψ", "ǟx", "äx"].sorted()
print(someStrings)
// ["a", "ã", "ă", "ä", "ǟ", "b"]

and dump the Unicode code points of each string (in original and decomposed form) in the sorted array:

for str in someStrings {
    print("\(str)  \(str.unicodeData)  \(str.decomposedStringWithCanonicalMapping.unicodeData)")
}

The output

äx  00E4,0078  0061,0308,0078
ǟx  01DF,0078  0061,0308,0304,0078
ǟψ  01DF,03C8  0061,0308,0304,03C8
äψ  00E4,03C8  0061,0308,03C8

nicely shows that the comparison is done by a lexicographic ordering of the Unicode code points in the decomposed form.

This is also true for strings of more than one character, as the following example shows. With

let someStrings = ["ǟψ", "äψ", "ǟx", "äx"].sorted()

the output of above loop is

äx  00E4,0078  0061,0308,0078
ǟx  01DF,0078  0061,0308,0304,0078
ǟψ  01DF,03C8  0061,0308,0304,03C8
äψ  00E4,03C8  0061,0308,03C8

which means that

"äx" < "ǟx", but "äψ" > "ǟψ"

(which was at least unexpected for me).

Finally let's compare this with a locale-sensitive ordering, for example swedish:

let locale = Locale(identifier: "sv") // svenska
var someStrings = ["ǟ", "ä", "ã", "a", "ă", "b"]
someStrings.sort {
    $0.compare($1, locale: locale) == .orderedAscending
}

print(someStrings)
// ["a", "ă", "ã", "b", "ä", "ǟ"]

As you see, the result is different from the Swift < sorting.

edited Apr 09 '17 at 08:11

answered Sep 10 '14 at 21:07

Martin R

529,903
94
1,240
1,382

2

Addition/details ("official" ref.: from open source): from the [String.swift source code](https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift) we can see that e.g. the `<` operator for `String` is defined as `lhs._compareString(rhs) < 0` (which use `_swift_stdlib_unicode_compare_utf8_utf8`, itself), which we can track via https://github.com/apple/swift/blob/master/stdlib/public/stubs/UnicodeNormalization.cpp to `ucol_strcollIter` (see `MakeRootCollator` for collator settings) from ICU lib; i.e., using the [unicode collation algorithm](http://unicode.org/reports/tr10/). – dfrib Jul 12 '16 at 14:47
... ([link to relevant ICU lib](http://icu-project.org/apiref/icu4c/ucol_8h.html)) – dfrib Jul 12 '16 at 14:49
@dfri: Thanks for providing the links, much appreciated. I *think* it was also mentioned somewhere in Apple's documentation, but I cannot find it anymore. – Martin R Jul 12 '16 at 16:41
Happy to help. I also recall I've seen some mention of this in the docs, but no had success when I tried to find it earlier today. Seems like the Swift reference docs change quicker than the stdlib itself, and only by ninja-edits :) – dfrib Jul 12 '16 at 20:06
Your last example confused me as I am Swedish and the order should be `[a, b, å, ä]` (not sure about ă and ã but I guess they should be between a and b) since å and ä are separate letters that comes after z in the Swedish alphabet. After some time I realised that you entered locale identifier "se", which is the country code for Sweden but the language code for Northern Sami. The correct language code for Swedish is "sv" :) – LoPoBo Apr 07 '17 at 12:10
@LoPoBo: I apologize to all Swedish speaking people! With language code "sv" the result is `["a", "ă", "ã", "b", "ä", "ǟ"]` – does that look correct? I will update the answer tonight (and update it for Swift 3 as well). Thanks for your feedback, much appreciated! – Martin R Apr 07 '17 at 12:27
Yes that seems correct. When I wrote `[a, b, å, ä]` I had mistaken `ǟ` for `å` (they look quite similar in the code font). Only a, b and ä are regular letters of the Swedish alphabet, but the order of the other ones seems logical. – LoPoBo Apr 07 '17 at 13:18
1

Does the `<` operator guarantee transitivity (a < b and b < c implies a < c)? I ran into an example where this breaks, and I posted the example here https://stackoverflow.com/questions/46230471/string-comparison-in-swift-is-not-transitive . From your explanation I think it should guarantee transitivity but obviously this is not the case. – Pinch Sep 15 '17 at 02:19

score 1 · Answer 2 · answered Sep 07 '14 at 19:36

1

Changing the locale can change the alphabetical order, e.g. a case-sensitive comparison can appear case-insensitive because of the locale, or more generally, the alphabetical order of two strings is different.

answered Sep 07 '14 at 19:36

Miro Lehtonen

609
6
18

Does it mean that Swift stores its own table of all possible characters or it uses any standard like Unicode, etc? – Dmytro Plekhotkin Sep 07 '14 at 20:01
1

No, it doesn't mean that. It means the same as setting LC_ALL=C which means that we're comparing pure byte-values. – Miro Lehtonen Sep 07 '14 at 21:03

score 1 · Answer 3 · edited May 23 '17 at 10:31

1

Lexicographical ordering and locale-sensitive ordering can be different. You can see an example of it in this question: Sorting scala list equivalent to C# without changing C# order

In that specific case the locale-sensitive ordering placed _ before 1, whereas in a lexicographical ordering it's the opposite.

Swift comparison uses lexicographical ordering.

edited May 23 '17 at 10:31

Community

1
1

answered Sep 07 '14 at 19:49

Gabriele Petronella

106,943
21
217
235

Does lexicographical ordering mean alphabetical order? What about characters from different alphabets (each country has its own alphabet), how it knows what characters to consider as first? – Dmytro Plekhotkin Sep 07 '14 at 19:57
@GabrielePetronella: That's what I thought as well, but all the expressions `"a" < "ä"`, `"ä" < "b"` and `ClosedInterval("a", "b").contains("ä")` return true in my test project. – Martin R Sep 07 '14 at 20:45

What does it mean that string and character comparisons in Swift are not locale-sensitive?

3 Answers3

Linked

Related